
How Uber Manages a Million Writes per Second - danjoc
http://highscalability.com/blog/2016/9/28/how-uber-manages-a-million-writes-per-second-using-mesos-and.html
======
peterwwillis
tl;dr 20 clusters, one of which (the one with a million writes) records only
gps data, running an application in a container in an api-driven os
virtualization package

\--

I like the part where 10% performance overhead is not a big deal while at the
same time making a big deal about how mesos _can_ save you 30% performance
overhead.

And notice how at first they had a group of clusters for completely separate
things like API, storage, etc, but now with mesos they program against an
entire datacenter by using a different specific API for separate things like
APIs, storage, etc. Much... better.

Also now with mesos they can run different kinds of services on the same
machine. Kind of like containers. Kind of like separate applications on any
kind of machine at all.

 _" If, for example, one services uses a lot of CPU it matches well with a
service that uses a lot of storage or memory, then these two services can be
efficiently run on the same server. Machine utilization goes up."_

Because if I have a machine pegging 90% CPU, I totally want an i/o heavy job
running on that machine, because interrupts are, like, imaginary, man. Yeah,
i'm sure the machine's utilization goes up... right up to triple-digit load.

 _" Agility is more important than performance."_

So the ability to automate rolling out a virtual machine and application in as
abstracted a way as possible is more important than their 1-million-writes-
per-second goal. I think I know why they don't find 10% loss to be a problem
now... they can just keep spinning up machines until they reach their
performance goal. 150 machines per DC to store ride-sharing data seems a bit
much, but maybe there's a lot more going on behind the scenes than the few
features on their app.

~~~
icebraining
_Because if I have a machine pegging 90% CPU, I totally want an i /o heavy job
running on that machine, because interrupts are, like, imaginary, man. Yeah,
i'm sure the machine's utilization goes up... right up to triple-digit load._

Honest question: can't you dedicate some cores to handling the interrupts,
leaving the others for the other processes?

~~~
user5994461
As a matter of fact, yes... that's a "common" very advanced optimization.

I've mostly seen it for load balancers. Pin the network card interrupt on one
core and the load balancer on another.

~~~
sitkack
Or turn interrupts completely off and poll. If you always have work in the
queue, polling is less intensive, not more.

~~~
user5994461
That makes no sensen. Polling means the core is running at 100% all the time.

~~~
ddorian43
Yes, but it's actually faster. Source: scylladb.

~~~
user5994461
Still doesn't make sense in the context of a network card. There must be an
interrupt to pre-empt the system or there's gonna be packet loss.

~~~
diroussel
No, you just need to service the work before the buffer overflows.

------
stubish
I don't follow the 'Why run Cassandra in a container and not just on the whole
machine?' arguments. The first two seem to argue against the concept (because
you need to coordinate what machine the container is running on). And the
third talking about something completely different (its talking about a single
cluster vs multiple clusters, nothing to do with containers).

I'm curious how they can start up one new node in a cluster per minute, when
last I saw Cassandra required 2 minutes after bringing a new node online for
things to settle before starting a new one. Its a pain point for me running
tests that need to startup and teardown Cassandra clusters, and I'd love to
know when and how it can be avoided.

(Maybe I need to watch the actual talk)

~~~
stonemetal
I haven't worked with Cassandra much but I know the Elastic Search guys
suggest running multiple nodes per box if your hardware is too big. They claim
that the JVM works better if you keep its heap size under 32GB. Since
Cassandra also runs on the JVM maybe a similar logic applies to it? Later on
he mentions each Cassandra process gets 32GB heap.

~~~
user5994461
The 32GB is a global guidelines for ALL JVM applications.

Under 32GB, the JVM uses 32 bits pointers. [actually slightly under 32GB]

Over 32GB, the JVM uses 64 bits pointers.

Pointers consume a lot of space in a running program. Basically, a 30 GB heap
(with 32bits pointers) is the same as a 40GB heap (with 64bits pointers).

That means, there is a chasm between 30-40GB heap where it makes no sense to
run a server. The additional space is wasted by the bigger pointers.

Second, 30GB is a sweet stop. It's simple to justify and enough for most
applications. (A bigger heap will give longer GC time and other side effects,
most people don't need to get into that kind of optimizations).

------
rak00n
> For their largest clusters they are able to support more than a million
> writes/sec and ~100k reads/sec.

Why is read 10x more expensive than writes?

~~~
saurik
This is extremely common and is normally caused by an optimization for
streaming writes that leaves data relatively disorganized, in which case you
need to perform numerous reads to find data which only required a single write
to store.

~~~
user5994461
In the same topic, dunno if it applies to cassandra.

I've seen some databases work in "append only" mode. They write new data to
the end of the file. They never erase existing data.

It's generally a very efficient write patterns (even on good old spinning
drives) and it allows to always write in batch.

On the opposite, read are expensive, they require to "find" stuff from various
places and read it and verify it and [if configured] repeat on multi nodes to
compare the values.

~~~
snuxoll
Cassandra has an append-only log file during operation but unlike a RDBMS it's
not just used for transaction replay, so you're half write. Cassandra
periodically compacts the log and writes SSTable's to disk, but newer data and
tombstones are stored in the log for a while which as you surmised does have a
performance hit.

~~~
ddorian43
Performance hit is not from the log though, but from having non-reading-writes
(and ttls). So you have to look other versions too on disk for latest value.

I think newer data and tombstones would stay for a while in sstables, not in
the log.

------
Symbiote
2 writes from the driver, two from the passenger, each minute is 0.0666... per
second.

To get 1M/s, we need 15M cars.

[1] says there are about 2M trips per day, so does this mean Uber records the
location of all users every 30 seconds, rather than only when they're waiting
for or in a car?

[1] [http://expandedramblings.com/index.php/uber-
statistics/](http://expandedramblings.com/index.php/uber-statistics/)

~~~
chronic6l
They record every second but it is not written continuously.

------
easytiger
So assuming 75 hosts per cluster thats about 13333.33 writes per host. How is
that impressive at all?

~~~
user5994461
Depends on the spec of the servers :D

If they all have 16 cores 100GB RAM it's really unimpressive and we just found
out why Uber is not profitable.

~~~
easytiger
No, if they are single core with 1GB of ram they should be able to write
15k/sec easily

------
k_bx
> Existing Riak installations will be moved to Cassandra.

What's the reason behind this?

------
ebalit
Wouldn't dividing their database per city make the system a lot simpler?
Transfer between cities should be rare overall and it would limit the reliance
on replication.

Is there something obvious that I don't see?

~~~
yazaddaruvala
What is a "city"? Where does it start, where does it stop?

What if some rider starts in city X and ends in city Y (New York to Hoboken)?
How do you elegantly handle that transaction? You would need consistent
transactions across two databases. Usually its not possible, or at the least
extremely inefficient, to implement transactions at the application layer -
especially in a distributed system under heavy load.

Also you end up with a lot more operational responsibility. If (when) nodes
fail, how many nodes failed? How many instances of Casandra were effected? If
there were X node failures, was there data loss to any of the effected
instances? If there was data loss, how can we recover from a snapshot and
replay all recent data to that instance without replaying it to others?

Meanwhile, is the application layer really where you want to solve these
scaling issues? What happens if a single city becomes too large to handle, do
you partition again by neighborhood? If you're going to invest into anything,
invest into improving the database. Maybe create a specialized database to
handle your usecase more generally. Ideally, you want the application layer to
only provide heuristics that the database can use to efficiently partition its
data.

~~~
ebalit
I think that Uber is organized per city, at least on the business side. I'm
talking about cities as used in [1]. I'm not even sure that you can do inter-
city rides using Uber.

1: [https://www.uber.com/cities/](https://www.uber.com/cities/)

------
noescape
I wonder if it were simpler for Uber to use RethinkDB instead of this
complicated scheme they set up.

------
ccvannorman
OT: So, Uber does a ton of shady, illegal things and then gets to continue on
with cool tech posts on HN? Seems like the message is clear: If you're in
tech, do as much illegal shady shit as you can, because once you make money
your problems will all disappear and those you trampled will be forgotten.

------
Annatar
Cassandra is designed to be eventually consistent; that means that if the node
didn't yet get the update(s), a client will get stale data. This is basically
the same issue as with DNS.

If I define that incomplete data equals to corrupt data, which I do, I am
incapable of conceiving a scenario where corrupt data would be acceptable.

Why would one define it so? Because I'm convicted that computers should always
deliver correct data, and in my conviction, data can never be correct unless
it is also complete. Therefore, atomicity or bust; therefore, Cassandra =
false, and in such a case, milion writes per second is of no consolation.

~~~
malux85
At Uber scale (or Facebook scale, or Google scale, or any non-trivial scale)
you simply cannot say "Atomicity or bust", because there isn't a distributed
database that can give you atomicity at 1M writes a second : That technology
simply doesn't exist at this price point, or it would be 10x increase in
server costs.

You DONT need atomicity all the time, sometimes you do (Financial
Transactions), but a lot of the time you dont:

Facebook likes. Do they need to be atomic? Does it matter if a web request
gets a stale "like count" and the number of likes for an item is 12,544
instead of the 12,545 on someones page view because the database isn't atomic?
No, it doesn't matter.

Uber GPS data, Uber writes the GPS locations as your trip progresses - does it
matter that the last 1 or 2 locations might not be visible to the app or
external reports for a few seconds? No, it doesn't matter.

Ad-click data, Does it matter that the last 1 or 2 hundred impressions are not
visible in a report who's absolute values are in the millions or 10s of
millions? No, it doesn't matter - even though this can represent money, the
value of a couple of hundred impressions missing on the report is chump
change, not even worth thinking about.

With the exception of financial transactions, some safety critical machine
data, and a few other cases, atomicity is not required, and that's where these
distributed databases shine.

~~~
Annatar
_You DONT need atomicity all the time, sometimes you do (Financial
Transactions)_

Bingo. No atomicity, bad money or no money. I'd have a problem. I also have
this extreme dislike of the "garbage in, garbage out" principle, so either
what I'm delivering is correct, or I should be fired, because using computers
at that point is pointless. If we don't care about correctness or completeness
at all, then we just discounted two of the three reasons why computers exist
at all. That leaves us with speed, and at that point, what use is it?

~~~
snowwrestler
It's easy to sit back on principle and point out all the ways that other
people's work fails to live up to your ideals.

It's a lot harder to go out and build something that actually delivers your
ideals.

If you think a distributed database should provide full correct completeness
from all nodes while hitting 1 million writes per second, go build one that
does it. If you succeed, not only will you make yourself happy, you will make
yourself a shitload of money.

~~~
Annatar
_If you think a distributed database should provide full correct completeness
from all nodes while hitting 1 million writes per second, go build one that
does it._

I do that for a living. If I didn't, I would not have commented.

