
Distributed architecture concepts I learned when building a large payment system - gregdoesit
http://blog.pragmaticengineer.com/distributed-architecture-concepts-i-have-learned-while-building-payments-systems/
======
hardwaresofton
This post goes into a lot of concepts but doesn't really touch on anything
concrete.

Also the justification for the Actor model is flimsy compared to the other
concepts.

I'd love to see an actual diagram of the architecture that was built, and
maybe also why sharded + replicated postgres (+/\- kafka _maybe_ ) wasn't
enough to solve the data needs of this application.

Assuming they farm out the actual processing of payments (talking to credit
card companies) to stripe/paypal/braintree or whatever, all this "payment
system" is doing is taking API requests to make payments, calling out to
whatever processor they're using, and saving the results, and probably
maintaining the state machines for every transaction (making sure that invalid
state transitions don't happen).... It sounds like they broke up a monolith
but it's completely unclear _how_ they made it better.

~~~
cornholio
Contrast this article with the buzzword collider we had last week about how
movio saved 50K$ by moving to a Go-microservice-InfiniDB-neural-kubernetes-
something-something, that turned out to be a lack of basic knowledge on how to
use databases.

I found this piece interesting, it introduces some basic concepts in a
structured and easy to understand manner. Not very many writers take the time
to develop such articles, they are rare thus useful as reference to point
coworkers and for a quick recap.

The details of the precise technologies used will be irrelevant in 3 years,
the fundamental concepts will not. It's not a recipe, it's a conceptual primer
to build an architecture that scales.

~~~
hardwaresofton
I do remember that article -- this article is certainly better, but the actor
section explanation reeks of the one you buzzword collider soup you mentioned:

> Why did the actor model matter when building a large payments system? We
> were building a system with many engineers, a lot of them haivng distributed
> experience. We decided to follow a standard distributed model over coming up
> with distributed concepts ourselves, potentially reinventing the wheel.

Literally, the word "distributed" is just used repeatedly to get the point
across. While I might have a bit of an axe to grind with people that just
throw the actor model (or any pardigm for that matter, functional programming
doesn't solve every problem, sometimes you _should_ maybe just mutate some
state and call it a day) around as a way to fix things, I am well aware that
it is indeed a useful paradigm, this was a bad way to explain your choice of
using it. If he built the kind of system I think he did, I'm not sure that the
actor model even conveys much of a benefit, and was looking to that
explanation to drop some wisdom on me. It did not.

My problem is that this article is indeed _too_ basic. The problem is that he
goes to great lengths to desribe _why_ the architecture he built has the
methodologies/approaches he's laid out, but then forgets to lay out the actual
architecture -- this post is all ivory tower and none of the rubber-hits-the-
road that you'd expect from a good technical piece. Things fall apart when the
rubber hits the road, and sharing how they fell apart enriches people who read
the article.

Most intermediate engineers working with distributed systems (in good
engineering orgs) have already realized/run into these concepts, for those who
may not have heard of these concepts, the article is very helpful, but for
those who already know them, it's basically a rehashing of level 1.

For example, take any single chapter of the Google SRE handbook (e.g.
[https://landing.google.com/sre/book/chapters/service-
level-o...](https://landing.google.com/sre/book/chapters/service-level-
objectives.html)), his post reads like the first half of a chapter, not the
complete thing. Of course, the SRE handbook is a high bar to set for quality,
but at least a whiff of this production-grade payments system he built would
have been nice.

------
fake-name
Kyle Kingsbury/Aphyr's Jepsen distributed DB testing writeups are also a
really, really wonderful base for comprehending some of the failings of
distributed systems.

[https://aphyr.com/tags/Jepsen](https://aphyr.com/tags/Jepsen)
[https://jepsen.io/analyses](https://jepsen.io/analyses)

~~~
hardwaresofton
Those write ups are bar none the best writeups I've seen that focus on the
distributed databases. If I had to list the things you can learn from them:

\- Efficient (and intelligent) testing of distributed systems (check out his
clojure libraries that he uses)

\- Effective clojure

\- Introduction (and re-introduction, which is IMO one of the best ways to
learn) of distributed system guarantees (linearizability, serializability, and
lesser guarantees), in the context of database system guarantees

\- how to write good information dense blog posts

\- how to deftly (and openly) navigate the ethics of paid testing of a product
by a corporate entity

\- how to critically think about distributed database hype trains

Also, plug for RethinkDB, the document store database that got it just about
right the first time he tested ([https://aphyr.com/posts/329-jepsen-
rethinkdb-2-1-5](https://aphyr.com/posts/329-jepsen-rethinkdb-2-1-5) and re-
configured [https://aphyr.com/posts/330-jepsen-
rethinkdb-2-2-3-reconfigu...](https://aphyr.com/posts/330-jepsen-
rethinkdb-2-2-3-reconfiguration))

~~~
y4mi
i also like them, especially how well he documents his findings and their
implications.

Before anyone dismisses one of his tested databases, I'd however like caution
against trusting his work fully at this point. Some of the documents are 5+
years old, thats a lot of time for the database developers to resolve issues.

thankfully, the tests are all replicable, so if you're evaluating cassandra
for example, please repeat his tests before dismissing the database.

------
est
> especially around resharding. Foursquare had a 17 hour downtime in 2010 due
> to hitting a sharding edge case

so I opened link

[http://highscalability.com/blog/2010/10/15/troubles-with-
sha...](http://highscalability.com/blog/2010/10/15/troubles-with-sharding-
what-can-we-learn-from-the-foursquare.html)

> What Happened? The problem went something like: Foursquare uses MongoDB to
> store user data on two EC2 nodes, each of which has 66GB of RAM

There's your problem. Using Mongo sharding in production in 2010.

~~~
dajonker
Nice example. I have personally seen Hadoop jobs which would only scale by
using larger VM instance types.

~~~
maxnevermind
What are Hadoop jobs? MapReduce? Someone still uses them, why?

~~~
federicoponzi
Why not? Use comodity server to compute long running tasks dosen't seems a
trend fading away anytime soon.

There are more solutions other than M/R now though (spark, presto etc) but
they require high performance servers (presto suggested RAM amount is 128GB).

~~~
maxnevermind
Can you name a single type of job that MapReduce can do better(faster/using
less resources) then Spark? In my experience even for most simplest tasks like
when you need to just read the data, change it slightly without shuffling and
write it back Spark is faster then MapReduce with the same limitations on
resources, and it's much more efficient in case of heavy jobs with joins etc.
Well of course, there are other things like API which in MapReduce is just a
nightmare to deal with in comparison to Spark.

------
peterwwillis
FWIW, if your system scales horizontally, it also has to scale vertically. You
cannot simply add nodes to a cluster into infinity. You often can't even do 5x
the number of nodes before performance and availability starts to degrade. The
larger the number of nodes, the more difficult to mange, and the more chances
for failure.

The author writes how they didn't think a mainframe could handle the load in
the future. But even super old mainframes still handle their load 40+ years
on. They just had better design parameters to make sure they were used
appropriately, and they knew how to keep their requirements compact. You may
only have 8 alphanumeric characters to store a customer's record name in, but
you're never going to have more than 2.8 trillion customers. That hard limit
also means you how large the data will get, which makes capacity planning
easier. (Even in the cloud you must do capacity planning, if for nothing else,
cost projections)

~~~
maxnevermind
I'm not that experienced in distributed systems but just interested in a
topic. "You often can't even do 5x the number of nodes before performance and
availability starts to degrade." Why? Can you give an example? So if I have 50
Casandra nodes scaling to 250 will be a problem or you mean there will be
issues on a application level as a whole during that scaling?

~~~
merb
you probably won't write your data on all 250 nodes... assuming you have a
distributed database you do not write to more than 3-9 nodes. (and even 7-9 is
way too much in most cases).

if you would write to all nodes your performance would be terrible. I also do
not think that they needed more than one master in their payment system. I
mean it's highly unlikely that a big machine couldn't handle all their
transactions. if it would still be a problem I would still use a old school
rds and shard the hell out of it.

however keep in mind that uber did a lot of wierd things in their past, from
an engineering perspective.

~~~
peterwwillis
Well, ideally you do actually want to spread writes across all the nodes
evenly, but in practice this rarely works as expected, and you do get uneven
loading across the cluster. Only one of many reasons why sharding is crap.

Re: scaling Cassandra, I haven't experienced it personally, but I've heard
horror stories regarding replication and load balancing when nodes start to go
down. With other systems, you see things like uneven loading, running out of
storage or compute, limits on iops or network bandwidth, and literally the
probability of overall failures increasing because math. It's also just harder
in general to manage 250 nodes rather than 50. Running commands takes longer,
you need bigger infrastructure, you start running into weird limits like what
your original subnet sizes were set to, etc. There are a myriad number of
variables that just become more difficult the bigger things get.

Now, compare this to three giant-ass nodes. You start getting close to
capacity, and maybe you add a fourth, and that lasts you another year. So.
Much. Freaking. Easier. AND more reliable. There's no contest - if you can
scale mostly-vertically, do it. (Unless you're Google or Facebook and have a
few billion dollars to spend on engineering, in which case this actually _is_
scaling mostly-vertically, it's just datacenters instead of nodes)

~~~
wsy
It really depends on your application.

Access services (e.g. browsing and searching in a product catalog) scale
horizontally without any limitations. For this type of applications, 4 huge
servers are more expensive, less reliable, and not significantly easier.

On the other hand services around scarce resources (e.g. a last-minute travel
booking application) are indeed very hard to scale horizontally, so in such
cases the parent's advice is sound.

------
epberry
Great overview with lots of links. I came across quite a few concepts I barely
know - or know them in a different context. I did trip hard over this
sentence:

> Distributed systems often have to store a lot more data, than a single node
> can do so.

~~~
spooneybarger
Did you figure it out? If not, I assume from what I read that the author is
referring to many systems keeping replicas of data on different members of the
cluster. By keeping replicas, you increase the amount of data stored by can
survive the failure of any given node.

~~~
mandeepj
Microsoft's Service Fabric is an open source distributed system build on the
same principles. Check it out - [https://azure.microsoft.com/en-
us/services/service-fabric/](https://azure.microsoft.com/en-
us/services/service-fabric/)

~~~
polskibus
Is it really open source? Do you have a link?

~~~
davegardner
Yes it is - [https://github.com/Microsoft/service-
fabric](https://github.com/Microsoft/service-fabric)

~~~
polskibus
That's recent! Great to see it! However Windows build tools are not ready yet
as per readme, also there are no releases since open sourcing yet.

------
inoop
Out of curiosity, have you considered using a workflow management system such
as AWS Step Functions [1] instead of a set of queues? If so, what made you
decide to go with queues?

Workflow engines neatly solve a lot of the challenges you mention in the
article essentially for free:

* Model your problem as a directed graph of (small) dependent operations. Adding new steps does not require adding new queues, just new code.

* let the workflow engine manage the state of your transactions (the important, strongly consistent part) and just focus on implementing the logic. Scale horizontally by adding more worker nodes.

* get free error handling tools, such as automatic re-tries on failed steps, support for special error handling steps, and metrics you can alarm on (e.g. number of open workflows, number of errored workflows, etc.)

* get a free dashboard where you can look up the state of a given transaction, look at error messages, (mass) re-drive failed transactions (e.g. after an outage or after having fixed a bug), etc.

[1] [https://aws.amazon.com/step-functions](https://aws.amazon.com/step-
functions)

~~~
bm1362
Uber recently open sourced Cadence, which is similar to AWS SWF [1] and built
by the same principal engineer. A good overview was presented at Data @ Scale
[2] if you're interested. Under the covers, it manages a set of queues for you
and lets you write procedural business logic in a resilient way as an implicit
workflow.

[1] [https://github.com/uber/cadence](https://github.com/uber/cadence) [2]
[https://atscaleconference.com/videos/cadence-microservice-
ar...](https://atscaleconference.com/videos/cadence-microservice-architecture-
beyond-requestreply/)

------
simplify
I love how they mentioned Stack Overflow successfully scaled vertically.
Inspirational.

~~~
contingencies
Most sites can do this - eg. I believe HN is scaled vertically. Over-
engineering is rife within the industry!

 _Design up front for reuse is, in essence, premature optimization._ \-
@AnimalMuppet

 _Pike 's 2nd Rule: Measure. Don't tune for speed until you've measured, and
even then don't unless one part of the code overwhelms the rest._ \- Rob Pike,
Notes on C Programming (1989)

... quotes from
[http://github.com/globalcitizen/taoup](http://github.com/globalcitizen/taoup)

~~~
bradhe
Agreed. If you understand the resource requirements of your environment then
vertical scaling is totally viable. Ive seen lots of scenarios where teams
shoot themselves in the foot because they fail to just think critically for a
second about their deployments.

Consider a Redis-based cache: For future proofing you could introduce a
sharding scheme that deterministically allocates keys based on cluster size.
That would be operationally pretty complex. Not to mention also introduce
complexity in edge case scenarios (e.g. your cluster size is transitioning).

Alternatively, you could do a little back of the napkin math to figure out
what the size of your keyspace is going to realistically be and
just...allocate a machine that fits your needs up front.

Obviously we could make arguments about budget and all that, but those are
optimizations that are better to defer.

~~~
greenleafjacob
> Consider a Redis-based cache: For future proofing you could introduce a
> sharding scheme that deterministically allocates keys based on cluster size.
> That would be operationally pretty complex. Not to mention also introduce
> complexity in edge case scenarios (e.g. your cluster size is transitioning).

This is implemented in Twemproxy and resizing the cluster is a known edge case
for which Twitter developed Nighthawk (centralized shard placement).

------
segmondy
I've built a payment system before, I didn't make it distributed and if I was
to build one again. I won't make it distributed. I like my payment system old
school batch style. Correctness is very important to me and ability to
rollback and replay things if things went wrong are very important.

~~~
jacquesm
Batch vs realtime, correctness, rollback & reply have nothing to do with
whether or not you build your system distributed or not.

In fact a distributed system can have some advantages over a monolith in an
application like that if done properly, for instance you could set up a
central message log that keeps data for an x period allowing you to 'rollback
and replay' to your hearts content.

------
matchagaucho
I wish the Author had distinguished _which_ payment system. There are
_Consumer_ payments that all Riders are familiar with and then there are
_Driver_ payments made at end of day.

~~~
Johnie
Traditionally, Rider and Driver payment systems were two separate systems.
This caused a lot of complications as the business became more complicated.
Traditionally, Riders paid Uber and Uber paid out to Drivers. However, with
Cash payments, the payment system also required Drivers to collect cash and
pay Uber. Then with Eats, Restaurants were neither Driver or Rider.

The new payment system unified all of these into pure money movement where any
entity can move money to any other entity. This enables all current and future
products to have flexibility in payment models. (Think about how payments
would be modeled for Uber Freight or with the new Uber Transit partnership
[1])

Source: I work on the Uber Payment System

[1] [http://www.businessinsider.com/how-uber-masabi-mobile-
train-...](http://www.businessinsider.com/how-uber-masabi-mobile-train-
tickets-works-2018-4)

~~~
icebraining
Is there anything special about it, or is it just a regular ledger?

~~~
Johnie
At the end of the day, much of all payment systems is just a ledger. Payment
systems have to be provably correct, reliable, and available. Many payment
systems make tradeoffs between all of these. Much of existing payment systems
are all batch based.

What we have built at Uber is a realtime system that is reliable, provably
correct with high availability.

As I've said in another comment, for many other systems, if it fails, people
can just try again later. If we fail, someone's mother, grandfather, son or
daughter are stuck on the side of the road in the middle of the night
somewhere. Our needs (or at least the standard that we put on ourselves) is
higher than many other payment systems.

------
stoicman
This article is golden, and not only from few aspects, especially handy if you
are learning distributed systems in the current semester, and the subject is
not properly covered.

Thank you!

------
susegado
cockroachdb is a distributed SQL database system that solves all the problems
mentioned in this article.

------
mozumder
I'll never understand why a payment system needs to scale horizontally.

Modern computer systems can scale to 500 or more processor cores. Each core
runs billions of instructions per second.

A system for a billion accounts on the scale of Uber probably has a million
active users at a time, probably a quarter that involving payment.

Is Uber saying that they can't support 250k payment transactions per second on
the largest system today? That's maybe 1000 transactions per second per core
on the largest systems, or about 1ms per transaction. Why is that impossible
for them?

Or, put it another way, why can't one transaction be completed in less than 1
million CPU instructions?

And that's for the very largest company like Uber.. can't even imagine a
typical startup needing to scale horizontally for payment processing.

~~~
saryant
All comes down to uptime.

One box can’t be distributed across multiple racks in the data center to guard
against downtime if a switch crashes. Never mind that—one box can’t be
deployed across multiple data centers. If you deploy to multiple DCs you can
fail over if one DC starts having issues.

Then there’s deploys. Do you canary your deploys? Deploy the next release to a
subset of production nodes, watch for regressions and let it ramp up from
there? Okay, I’ll give you that one, it could be done on one big box.

In any case, payments aren’t CPU intensive but it’s a prime case of hurry-up-
and-wait. Lots of network IO, so while you won’t saturate the CPU with
millions of transactions on the same box, I could easily imagine saturating a
NIC. Deploying to shared infrastructure? Better hope none of your neighbors
need that bandwidth too.

One transaction likely involves checking account and payment method status,
writing audit logs, checking in with anti-fraud systems and a number of other
business requirements.

(I lead a payments team, not at Uber but another major tech company)

~~~
mozumder
> One box can’t be distributed across multiple racks in the data center to
> guard against downtime if a switch crashes. Never mind that—one box can’t be
> deployed across multiple data centers. If you deploy to multiple DCs you can
> fail over if one DC starts having issues.

Wouldn't you just have multiple NICs on one box for redundancy there? With any
backups being sent a database write-log for replication?

> n any case, payments aren’t CPU intensive but it’s a prime case of hurry-up-
> and-wait. Lots of network IO, so while you won’t saturate the CPU with
> millions of transactions on the same box, I could easily imagine saturating
> a NIC.

If you're vertically scaling, wouldn't you just have the main database server
host the database files locally, using fast NVMe SSDs (or Optame), in the box
itself, instead of going over the network?

Enterprise NVMe drives can perform 500,000-2,000,000 IOPs, with about 60us
latency. And Optane is about 4x faster. Why would a database server need to
saturate network bandwidth?

Anyways, I'd love to see the actual SQL query for one of their transactions...

~~~
icebraining
_Wouldn 't you just have multiple NICs on one box for redundancy there?_

What happens when the FBI raids the DC to confiscate the servers of another
person, and also takes yours?
[https://blog.pinboard.in/2011/06/faq_about_the_recent_fbi_ra...](https://blog.pinboard.in/2011/06/faq_about_the_recent_fbi_raid/)

------
skepticmoron
Distributed systems is a service. Isn’t it.

------
skepticmoron
Such a rudimentary take on distributed systems. Why not just create stateless
rest services and horizontally scale to a million virtual machines. After all,
Uber is a heavily funded company.

