Hacker News new | past | comments | ask | show | jobs | submit login
Distributed architecture concepts I learned when building a large payment system (pragmaticengineer.com)
441 points by gregdoesit on April 16, 2018 | hide | past | favorite | 74 comments

This post goes into a lot of concepts but doesn't really touch on anything concrete.

Also the justification for the Actor model is flimsy compared to the other concepts.

I'd love to see an actual diagram of the architecture that was built, and maybe also why sharded + replicated postgres (+/- kafka maybe) wasn't enough to solve the data needs of this application.

Assuming they farm out the actual processing of payments (talking to credit card companies) to stripe/paypal/braintree or whatever, all this "payment system" is doing is taking API requests to make payments, calling out to whatever processor they're using, and saving the results, and probably maintaining the state machines for every transaction (making sure that invalid state transitions don't happen).... It sounds like they broke up a monolith but it's completely unclear how they made it better.

Contrast this article with the buzzword collider we had last week about how movio saved 50K$ by moving to a Go-microservice-InfiniDB-neural-kubernetes-something-something, that turned out to be a lack of basic knowledge on how to use databases.

I found this piece interesting, it introduces some basic concepts in a structured and easy to understand manner. Not very many writers take the time to develop such articles, they are rare thus useful as reference to point coworkers and for a quick recap.

The details of the precise technologies used will be irrelevant in 3 years, the fundamental concepts will not. It's not a recipe, it's a conceptual primer to build an architecture that scales.

I do remember that article -- this article is certainly better, but the actor section explanation reeks of the one you buzzword collider soup you mentioned:

> Why did the actor model matter when building a large payments system? We were building a system with many engineers, a lot of them haivng distributed experience. We decided to follow a standard distributed model over coming up with distributed concepts ourselves, potentially reinventing the wheel.

Literally, the word "distributed" is just used repeatedly to get the point across. While I might have a bit of an axe to grind with people that just throw the actor model (or any pardigm for that matter, functional programming doesn't solve every problem, sometimes you should maybe just mutate some state and call it a day) around as a way to fix things, I am well aware that it is indeed a useful paradigm, this was a bad way to explain your choice of using it. If he built the kind of system I think he did, I'm not sure that the actor model even conveys much of a benefit, and was looking to that explanation to drop some wisdom on me. It did not.

My problem is that this article is indeed too basic. The problem is that he goes to great lengths to desribe why the architecture he built has the methodologies/approaches he's laid out, but then forgets to lay out the actual architecture -- this post is all ivory tower and none of the rubber-hits-the-road that you'd expect from a good technical piece. Things fall apart when the rubber hits the road, and sharing how they fell apart enriches people who read the article.

Most intermediate engineers working with distributed systems (in good engineering orgs) have already realized/run into these concepts, for those who may not have heard of these concepts, the article is very helpful, but for those who already know them, it's basically a rehashing of level 1.

For example, take any single chapter of the Google SRE handbook (e.g. https://landing.google.com/sre/book/chapters/service-level-o...), his post reads like the first half of a chapter, not the complete thing. Of course, the SRE handbook is a high bar to set for quality, but at least a whiff of this production-grade payments system he built would have been nice.

>the buzzword collider we had last week about how movio saved 50K$ by moving to a Go-microservice-InfiniDB-neural-kubernetes-something-something, that turned out to be a lack of basic knowledge on how to use databases.

even the rudimentary experience/skills with the former gets you hot $400K+ jobs while even the advanced experience/skills with the latter gets you only $200K jobs.

And this is why the whole thing reads like an advert towards future employers to me. Thats ok, but as an engineer I want to know about options and tradeoffs.

Curiously, I had the same thought reading this. I found the bolded/italicized sentences particularly telling. I would expect this from an undergrad, but not someone who has worked on a large production system:

> Why did data durability matter when building a payments system? For many parts of the system, no data could be lost, given this being something critical, like payments.

I wish the author had shared more first-hand experience, especially those little anecdotes you can never find in textbooks...

Kyle Kingsbury/Aphyr's Jepsen distributed DB testing writeups are also a really, really wonderful base for comprehending some of the failings of distributed systems.

https://aphyr.com/tags/Jepsen https://jepsen.io/analyses

Those write ups are bar none the best writeups I've seen that focus on the distributed databases. If I had to list the things you can learn from them:

- Efficient (and intelligent) testing of distributed systems (check out his clojure libraries that he uses)

- Effective clojure

- Introduction (and re-introduction, which is IMO one of the best ways to learn) of distributed system guarantees (linearizability, serializability, and lesser guarantees), in the context of database system guarantees

- how to write good information dense blog posts

- how to deftly (and openly) navigate the ethics of paid testing of a product by a corporate entity

- how to critically think about distributed database hype trains

Also, plug for RethinkDB, the document store database that got it just about right the first time he tested (https://aphyr.com/posts/329-jepsen-rethinkdb-2-1-5 and re-configured https://aphyr.com/posts/330-jepsen-rethinkdb-2-2-3-reconfigu...)

i also like them, especially how well he documents his findings and their implications.

Before anyone dismisses one of his tested databases, I'd however like caution against trusting his work fully at this point. Some of the documents are 5+ years old, thats a lot of time for the database developers to resolve issues.

thankfully, the tests are all replicable, so if you're evaluating cassandra for example, please repeat his tests before dismissing the database.

> especially around resharding. Foursquare had a 17 hour downtime in 2010 due to hitting a sharding edge case

so I opened link


> What Happened? The problem went something like: Foursquare uses MongoDB to store user data on two EC2 nodes, each of which has 66GB of RAM

There's your problem. Using Mongo sharding in production in 2010.

> Users, written to the database, did not distribute evenly over the two shards. One shard grew to 67GB, larger than RAM, and the other to 50GB

Main pro-tip: think very hard on how you're sharding your data, based on it and how you're sharding (hash based? range based?) and how do you intend to rebalance it or be more granular should the need arise.

Nice example. I have personally seen Hadoop jobs which would only scale by using larger VM instance types.

What are Hadoop jobs? MapReduce? Someone still uses them, why?

Why not? Use comodity server to compute long running tasks dosen't seems a trend fading away anytime soon.

There are more solutions other than M/R now though (spark, presto etc) but they require high performance servers (presto suggested RAM amount is 128GB).

Can you name a single type of job that MapReduce can do better(faster/using less resources) then Spark? In my experience even for most simplest tasks like when you need to just read the data, change it slightly without shuffling and write it back Spark is faster then MapReduce with the same limitations on resources, and it's much more efficient in case of heavy jobs with joins etc. Well of course, there are other things like API which in MapReduce is just a nightmare to deal with in comparison to Spark.

FWIW, if your system scales horizontally, it also has to scale vertically. You cannot simply add nodes to a cluster into infinity. You often can't even do 5x the number of nodes before performance and availability starts to degrade. The larger the number of nodes, the more difficult to mange, and the more chances for failure.

The author writes how they didn't think a mainframe could handle the load in the future. But even super old mainframes still handle their load 40+ years on. They just had better design parameters to make sure they were used appropriately, and they knew how to keep their requirements compact. You may only have 8 alphanumeric characters to store a customer's record name in, but you're never going to have more than 2.8 trillion customers. That hard limit also means you how large the data will get, which makes capacity planning easier. (Even in the cloud you must do capacity planning, if for nothing else, cost projections)

I'm not that experienced in distributed systems but just interested in a topic. "You often can't even do 5x the number of nodes before performance and availability starts to degrade." Why? Can you give an example? So if I have 50 Casandra nodes scaling to 250 will be a problem or you mean there will be issues on a application level as a whole during that scaling?

you probably won't write your data on all 250 nodes... assuming you have a distributed database you do not write to more than 3-9 nodes. (and even 7-9 is way too much in most cases).

if you would write to all nodes your performance would be terrible. I also do not think that they needed more than one master in their payment system. I mean it's highly unlikely that a big machine couldn't handle all their transactions. if it would still be a problem I would still use a old school rds and shard the hell out of it.

however keep in mind that uber did a lot of wierd things in their past, from an engineering perspective.

Well, ideally you do actually want to spread writes across all the nodes evenly, but in practice this rarely works as expected, and you do get uneven loading across the cluster. Only one of many reasons why sharding is crap.

Re: scaling Cassandra, I haven't experienced it personally, but I've heard horror stories regarding replication and load balancing when nodes start to go down. With other systems, you see things like uneven loading, running out of storage or compute, limits on iops or network bandwidth, and literally the probability of overall failures increasing because math. It's also just harder in general to manage 250 nodes rather than 50. Running commands takes longer, you need bigger infrastructure, you start running into weird limits like what your original subnet sizes were set to, etc. There are a myriad number of variables that just become more difficult the bigger things get.

Now, compare this to three giant-ass nodes. You start getting close to capacity, and maybe you add a fourth, and that lasts you another year. So. Much. Freaking. Easier. AND more reliable. There's no contest - if you can scale mostly-vertically, do it. (Unless you're Google or Facebook and have a few billion dollars to spend on engineering, in which case this actually _is_ scaling mostly-vertically, it's just datacenters instead of nodes)

It really depends on your application.

Access services (e.g. browsing and searching in a product catalog) scale horizontally without any limitations. For this type of applications, 4 huge servers are more expensive, less reliable, and not significantly easier.

On the other hand services around scarce resources (e.g. a last-minute travel booking application) are indeed very hard to scale horizontally, so in such cases the parent's advice is sound.

Great overview with lots of links. I came across quite a few concepts I barely know - or know them in a different context. I did trip hard over this sentence:

> Distributed systems often have to store a lot more data, than a single node can do so.

Did you figure it out? If not, I assume from what I read that the author is referring to many systems keeping replicas of data on different members of the cluster. By keeping replicas, you increase the amount of data stored by can survive the failure of any given node.

Microsoft's Service Fabric is an open source distributed system build on the same principles. Check it out - https://azure.microsoft.com/en-us/services/service-fabric/

Poorly named, as it conflicts with the excellent Python based provisioning tool: http://www.fabfile.org/

"Fabric" has had a generic distributed meaning way before the Python tool: https://en.wikipedia.org/wiki/Fabric_computing

Microsoft's full product name is "Azure Service Fabric" which properly disambiguates it rather than staking a claim to a common network-related single English word as you claim the Python fabric library does.

Fair enough, TIL

Is it really open source? Do you have a link?

That's recent! Great to see it! However Windows build tools are not ready yet as per readme, also there are no releases since open sourcing yet.

Out of curiosity, have you considered using a workflow management system such as AWS Step Functions [1] instead of a set of queues? If so, what made you decide to go with queues?

Workflow engines neatly solve a lot of the challenges you mention in the article essentially for free:

* Model your problem as a directed graph of (small) dependent operations. Adding new steps does not require adding new queues, just new code.

* let the workflow engine manage the state of your transactions (the important, strongly consistent part) and just focus on implementing the logic. Scale horizontally by adding more worker nodes.

* get free error handling tools, such as automatic re-tries on failed steps, support for special error handling steps, and metrics you can alarm on (e.g. number of open workflows, number of errored workflows, etc.)

* get a free dashboard where you can look up the state of a given transaction, look at error messages, (mass) re-drive failed transactions (e.g. after an outage or after having fixed a bug), etc.

[1] https://aws.amazon.com/step-functions

Uber recently open sourced Cadence, which is similar to AWS SWF [1] and built by the same principal engineer. A good overview was presented at Data @ Scale [2] if you're interested. Under the covers, it manages a set of queues for you and lets you write procedural business logic in a resilient way as an implicit workflow.

[1] https://github.com/uber/cadence [2] https://atscaleconference.com/videos/cadence-microservice-ar...

Some parts of the Money system does use Cadence [1], Uber's internal workflow framework.

[1] https://github.com/uber/cadence

I love how they mentioned Stack Overflow successfully scaled vertically. Inspirational.

Most sites can do this - eg. I believe HN is scaled vertically. Over-engineering is rife within the industry!

Design up front for reuse is, in essence, premature optimization. - @AnimalMuppet

Pike's 2nd Rule: Measure. Don't tune for speed until you've measured, and even then don't unless one part of the code overwhelms the rest. - Rob Pike, Notes on C Programming (1989)

... quotes from http://github.com/globalcitizen/taoup

Agreed. If you understand the resource requirements of your environment then vertical scaling is totally viable. Ive seen lots of scenarios where teams shoot themselves in the foot because they fail to just think critically for a second about their deployments.

Consider a Redis-based cache: For future proofing you could introduce a sharding scheme that deterministically allocates keys based on cluster size. That would be operationally pretty complex. Not to mention also introduce complexity in edge case scenarios (e.g. your cluster size is transitioning).

Alternatively, you could do a little back of the napkin math to figure out what the size of your keyspace is going to realistically be and just...allocate a machine that fits your needs up front.

Obviously we could make arguments about budget and all that, but those are optimizations that are better to defer.

> Consider a Redis-based cache: For future proofing you could introduce a sharding scheme that deterministically allocates keys based on cluster size. That would be operationally pretty complex. Not to mention also introduce complexity in edge case scenarios (e.g. your cluster size is transitioning).

This is implemented in Twemproxy and resizing the cluster is a known edge case for which Twitter developed Nighthawk (centralized shard placement).

Yep, I much prefer the conceptual simplicity of a single deploy-able, and in my experience it can handle performance sufficiently if you actually try.

OTOH, my team owns ~10 micro-services, and some of their build times, due to integration-tests, are already painfully slow. I'd hate to have to run all of them every time we build.

HN is single core.

Even better. Another example: https://news.ycombinator.com/item?id=14443398

I've built a payment system before, I didn't make it distributed and if I was to build one again. I won't make it distributed. I like my payment system old school batch style. Correctness is very important to me and ability to rollback and replay things if things went wrong are very important.

Batch vs realtime, correctness, rollback & reply have nothing to do with whether or not you build your system distributed or not.

In fact a distributed system can have some advantages over a monolith in an application like that if done properly, for instance you could set up a central message log that keeps data for an x period allowing you to 'rollback and replay' to your hearts content.

I wish the Author had distinguished which payment system. There are Consumer payments that all Riders are familiar with and then there are Driver payments made at end of day.

Traditionally, Rider and Driver payment systems were two separate systems. This caused a lot of complications as the business became more complicated. Traditionally, Riders paid Uber and Uber paid out to Drivers. However, with Cash payments, the payment system also required Drivers to collect cash and pay Uber. Then with Eats, Restaurants were neither Driver or Rider.

The new payment system unified all of these into pure money movement where any entity can move money to any other entity. This enables all current and future products to have flexibility in payment models. (Think about how payments would be modeled for Uber Freight or with the new Uber Transit partnership [1])

Source: I work on the Uber Payment System

[1] http://www.businessinsider.com/how-uber-masabi-mobile-train-...

Is there anything special about it, or is it just a regular ledger?

At the end of the day, much of all payment systems is just a ledger. Payment systems have to be provably correct, reliable, and available. Many payment systems make tradeoffs between all of these. Much of existing payment systems are all batch based.

What we have built at Uber is a realtime system that is reliable, provably correct with high availability.

As I've said in another comment, for many other systems, if it fails, people can just try again later. If we fail, someone's mother, grandfather, son or daughter are stuck on the side of the road in the middle of the night somewhere. Our needs (or at least the standard that we put on ourselves) is higher than many other payment systems.

In addition to knowing which payment system, I'd love to know if Uber actually redid the work of credit card transactions from relative scratch, or if they just used something like Stripe or Braintree.

Uber uses Adyen.

Thanks for the pointer -- I've actually never heard of Adyen before, but the list of companies that trust them is pretty impressive.

I'm relaly impressed with the diversity of their offering as well:


I'm currently in Japan and it's actually relatively common to pay for things at the convenience store here, I'm amazed they support it.

I worked on payments at SoundCloud where we used Adyen as well. I didn't get much exposure but superficially it wasn't ideal. Integrations were wonky for each country and we had to glue it all together.

IIRC Braintree, which routes some transactions to Adyen (yes, it’s possible)

Unless I'm imagining it, it's in the first paragraph.

"I joined Uber two years ago as a mobile software engineer with some backend experience. I ended up building the payments functionality in the app - and rewriting the app on the way".

Following the "rewriting the app" link in that paragraph takes you to another page "Engineering the Architecture Behind Uber’s New Rider App" - although that's co-written by two people, neither of which are the author of this article it would seem.

This article is golden, and not only from few aspects, especially handy if you are learning distributed systems in the current semester, and the subject is not properly covered.

Thank you!

cockroachdb is a distributed SQL database system that solves all the problems mentioned in this article.

I'll never understand why a payment system needs to scale horizontally.

Modern computer systems can scale to 500 or more processor cores. Each core runs billions of instructions per second.

A system for a billion accounts on the scale of Uber probably has a million active users at a time, probably a quarter that involving payment.

Is Uber saying that they can't support 250k payment transactions per second on the largest system today? That's maybe 1000 transactions per second per core on the largest systems, or about 1ms per transaction. Why is that impossible for them?

Or, put it another way, why can't one transaction be completed in less than 1 million CPU instructions?

And that's for the very largest company like Uber.. can't even imagine a typical startup needing to scale horizontally for payment processing.

All comes down to uptime.

One box can’t be distributed across multiple racks in the data center to guard against downtime if a switch crashes. Never mind that—one box can’t be deployed across multiple data centers. If you deploy to multiple DCs you can fail over if one DC starts having issues.

Then there’s deploys. Do you canary your deploys? Deploy the next release to a subset of production nodes, watch for regressions and let it ramp up from there? Okay, I’ll give you that one, it could be done on one big box.

In any case, payments aren’t CPU intensive but it’s a prime case of hurry-up-and-wait. Lots of network IO, so while you won’t saturate the CPU with millions of transactions on the same box, I could easily imagine saturating a NIC. Deploying to shared infrastructure? Better hope none of your neighbors need that bandwidth too.

One transaction likely involves checking account and payment method status, writing audit logs, checking in with anti-fraud systems and a number of other business requirements.

(I lead a payments team, not at Uber but another major tech company)

> One box can’t be distributed across multiple racks in the data center to guard against downtime if a switch crashes. Never mind that—one box can’t be deployed across multiple data centers. If you deploy to multiple DCs you can fail over if one DC starts having issues.

Wouldn't you just have multiple NICs on one box for redundancy there? With any backups being sent a database write-log for replication?

> n any case, payments aren’t CPU intensive but it’s a prime case of hurry-up-and-wait. Lots of network IO, so while you won’t saturate the CPU with millions of transactions on the same box, I could easily imagine saturating a NIC.

If you're vertically scaling, wouldn't you just have the main database server host the database files locally, using fast NVMe SSDs (or Optame), in the box itself, instead of going over the network?

Enterprise NVMe drives can perform 500,000-2,000,000 IOPs, with about 60us latency. And Optane is about 4x faster. Why would a database server need to saturate network bandwidth?

Anyways, I'd love to see the actual SQL query for one of their transactions...

Wouldn't you just have multiple NICs on one box for redundancy there?

What happens when the FBI raids the DC to confiscate the servers of another person, and also takes yours? https://blog.pinboard.in/2011/06/faq_about_the_recent_fbi_ra...

I'm largely referring to RPC calls, not DB queries. Many of those calls won't even be to services you control and may well be HTTP calls to other companies.

All comes down to uptime.

20 years ago we had 1000+ days uptime on DEC kit. No one was even impressed by 500 days. Nowadays people build all sorts of elaborate contraptions to do what used to be entirely ordinary

By uptime people usually mean availability to the end users, not a literal uptime. Which also includes availability of an entire datacenter infrastructure, connectivity, internet infrastructure, making it pretty much impossible to have high availability in a singe datacenter.

Heh, I guess. In my scenario the users actually got that uptime too, ‘cos they were connected over LAT...

Doesn't do much good if you have to fail out of an entire data center.

You can with VMScluster. There are multi-site clusters with 15+ years uptime.

Even at 250k transactions per second that's over 20 billion transactions per day which seems unlikely. 1k transactions per second (100 million per day) is probably a closer ballpark figure for a company like Uber, given they only have around 3 million drivers worldwide. The problem definitely would scale vertically however it does depends on how they interact with the API of their payment processor.

Maybe I am underestimating Uber, but when I think of enormous scale, in terms of number of transaction per day, I think Amazon, or Mastercard.

Uber? The upper bound of transactions is restricted to the total number of cab rides a day. That number isnt in the billions is it?

> why can't one transaction be completed in less than 1 million CPU instructions

Transactions in this system are almost certainly network bound. The relevant CPU overhead is likely trivial—-likely comparison and not arithmetic in nature even. In that context “add more NICs” is effectively an exercise in horizontal scaling. On top of that any network operation has consistency concerns to contend with.

You could contrive a system that is block device I/O bound but it’s likely to have significant network overhead as most block devices are network attached these days anyway!

You're being downvoted - maybe because of your tone - but I partially share your sentiment. A good counterexample to "overdistribution", not without its own set of problems, is the LMAX architecture.

See: https://martinfowler.com/articles/lmax.html

Because when your task is embarrassingly parallelizable, as payments are, you owe it to yourself to go ahead and take advantage of that to endure your part of the system isn't the bottleneck.

You're being downvoted because the article actually explains this quite well.

Maybe they use javascript instead of using a real language?

It's mostly Go/Java.

Distributed systems is a service. Isn’t it.

Such a rudimentary take on distributed systems. Why not just create stateless rest services and horizontally scale to a million virtual machines. After all, Uber is a heavily funded company.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact