Hacker News new | past | comments | ask | show | jobs | submit login
How We Manage a Million Push Notifications an Hour (gojekengineering.com)
140 points by shadykiller 53 days ago | hide | past | web | favorite | 88 comments

We had a similar challenge supporting sending this many notifications at OneSignal, which we solved using Rust.

We recently hit a peak of 850 Million notifications per second, and 5 billion notifications per day. Here's a blog post on how we do it. Written back when we were at "only" 2 billion notifications per week: https://onesignal.com/blog/rust-at-onesignal/

ROFL : I was about to post a message here bragging how i achieved 2 millions push a minute with a simple kafka queues + a cluster of push servers running on go architecture, but now i feel humbled !

I really like the diagram in this blog post. Is that diagram custom made or did you use software to create it?

No offence but is this correct?

> We recently hit a peak of 850 Million notifications per second

per second?

From your blog "OnePush is fast - We've observed sustained deliveries up to 125,000/second and spikes up to 175,000/second."

I think you may have a typo. The bandwidth would be incredible too, if it were an unlikely 10 bytes per delivery that would be 8.5 GB/Sec.

This post is now a couple years old, so we've grown a lot since then. Here's a newer (more marketing-centric) post where we announced our 850k/second milestone: https://onesignal.com/blog/throughput-record/

It's generally under 4 bytes per delivery, depending on the content, and we have several delivery servers. APNS, for example, doesn't support payloads larger than 4 bytes.

I think you mean 4kb rather than 4 bytes

Oops, you're right. There is a typo. It should say "850,000/second"

The one-day record could have been serviced in 5.8 seconds of peak per-second throughput, is that right?

It's not a perfect comparison because we benefit slightly by batching messages with the same content when sending to Android devices, and we have some backend features in place to reduce storage usage when the content is the same across many messages.

iOS, however, is still quite challenging at scale since there's no batching mechanism for APNS, so even bandwidth becomes a bottleneck. We do web push too, which requires a lot of CPU cycles to encrypt each payload for its recipient.

How does sharding/partitioning works in these cases ? How do you partition persistent connections going to 1 url/load-balancer in the backend ?

1 Million per hour is only 300 per second. On a 1.5 Ghz 4 core raspberry pi, that gives you 21 million clock cycles to deal with each message.

The architecture seems rather overengineered considering a single raspberry pi could do the job, even after 100x scaling!

I wrote an algorithmic trading framework to work on Warsaw Stock Exchange.

WSE back then had a limit of 10k messages per second (as in one message every 1/10000th of a second). Messages came by two separate network operators so that was 20k messages to be deduplicated to 10k operations. Incomming messages came in compressed so they required uncompressing.

Responding to the message required complex processing and then may have resulted in an order to market which, again, had to be constructed and validated.

All this worked on a single server (regular two-socket Xeon-based server).

There was 10us (microseconds..) time budget to send response to the market and it had to work every time even under maximum possible load (10k messages per second).

Routing and forwarding 300 messages per second doesn't seem like something to brag about...

Hi from Warsaw, care to tell something more about architecture, used technologies, numbers? Thanks in advance.

Yep, I hit the calculator first thing too, but I think your criticism is not quite fair, after all the data has to be pulled off disk and acknowledgements written back.

That said, it really doesn't sound too difficult with straightforward architecting (says the armchair critic).

I think this is a disingenuous or naive view, while you may not have been serious, I think it's worth unpacking why it's naive as I think a lot of engineers have similar opinions.

To start, there's nothing here about what machine this architecture runs on, it could be running on a Raspberry Pi for all we know.

Then, this ignores the cost of database lookups for the keys. That data is probably small enough to be on the one machine, but then you have to have service support for (reliably, in real time) syncing that data to the service. A separate database is therefore probably the right solution here, which means you're doing networking in each message send, which makes it unlikely that a Pi could do this.

Next up you've got the issues of reliability. The message queue separation gives you better reliability in the face of issues such as upstream APIs going down or erroring, or issues for a specific user. All the business logic around handling this, the message queue handling persistence and ACID semantics (or parts of it), this all takes additional resources, not to mention potentially a fair bit of disk space (for a Pi) to queue up undelivered messages should an upstream API slow down or stop accepting new messages.

Then you have hardware failure, at this scale you don't want a single machine failure to wipe out your primary communication method with millions of customers. You'd therefore want to have a distributed system, even if that's only for reliability rather than performance.

Lastly, 21 million clock cycles might sound like a lot, and might go a long way with C/C++/Rust, but as you move up to more dynamic languages that will reduce significantly. It happens that they are using Go here, and that's likely to get pretty good performance out of the hardware, but writing this service in Python/Ruby would be a very valid choice for developer productivity, or based on existing skills they have in the team. That might be 1/10th the performance, but since you need a distributed system for reliability here anyway, adding a few more machines to the pool might be a better choice than introducing a lower level language that takes longer to develop and exposes you to memory safety or threading issues.

There may well be other factors I haven't considered here, but I think for the use case of delivering that scale of messages, the reliability options you get with a system like this are well worth the additional hardware requirements and architecture overhead.

Edit: lmilcin makes a good point about trading systems, but there are several differences – that system still has a single point of failure, it was probably written in a low level language with input from experts on performance, and it was running on a much faster machine. The single point of failure of a server-grade machine like that is probably an acceptable risk if you own the hardware, but in a cloud environment (which brings other benefits) hardware is less reliable so probably not an acceptable risk there. I don't think it's an apples-to-apples comparison, although it is interesting.

I partly agree with you but you're probably over-engineering it. They talk about "GoLife booking notifications" and "promotional notifications". Neither of these require (edit: I mean sound like they require) once-only semantics. If you're OK with very rarely failing over then restarting from scratch on the failover machine is ok. Even if not, pushing batched IDs ("I've done IDs xyz..) over a network to the failover machine would remove most of that. I don't think it needs ACID semantics.

> Then you have hardware failure, at this scale

It's really, really, not scale!

> Lastly, 21 million clock cycles might sound like a lot, and might go a long way with C/C++/Rust, but as you move up to more dynamic languages that will reduce significantly

Then you're prob using the wrong lang.

I don't really accept that level of engineering is necessary all round, unless the business case requires it, and then I'd speak to whoever put those business requirements together and ask hard questions.

> It's really, really, not scale!

Talking to 1 million different people an hour is _business_ scale. Regardless of the tech required to do that, it not working would likely be a significant business impact.

> Then you're prob using the wrong lang.

There's so much more here than performance. There's developer productivity, there's tooling availability, all sorts. They happen to be using Go which is probably the best trade-offs for this particular system, but if you were doing complex machine learning you'd probably want to use Python due to all the excellent tooling available, and that means having a "slow" language for parts that aren't optimised for you.

If the difference is 1 engineer, ~1k lines of code, 3 machines, vs 3 engineers, ~10k lines of code and 1 machine, the former is likely to be the right trade off for most companies. It's cheaper to build, and since number of bugs typically correlates to lines of code, it will likely be much more reliable.

As for the ACID semantics, you're right you wouldn't need them all here, redelivery is probably fine within some bounds, and the upstream APIs might even have idempotency tokens to prevent this, but the downtime of losing a machine for a few hours, the time to regenerate all those messages that were lost on the downed machine, etc, could equate to quite a bit of downtime and poor UX for users.

You're not wrong from a performance perspective, but taking into account reliability, business impact, user experience, and developer productivity/costs, I think the solution in the blog post is a better set of trade-offs than you're suggesting.

And what happens if the raspberry pi goes down for an hour ?

You spend $35 on a spare.

And what if you don’t want to lose the notifications in the time it went down ?

Odroids have better price to performance in my opinion.

You buy ten more before hand and have all of them running side by side, to retake workload on any outage.

Replace raspberry pi with any local server configuration.

Well now you have an architecture.

My thoughts exactly.

One million per hour? I forgot how to count that low.

1 million events an hour is just 278 a second, which is a much more surmountable number than 1 million.

Same reaction. You could handle this kind of traffic with, just as examples, procmail or inetd on a single PC.

Hell, modern microcontrollers (which run into the several hundred megahertz) could handle that load.

What if you get one push event on the first second, and then 999,999 on the last second? Still one million per hour. Don’t assume the events are evenly distributed. I see this a lot on HN. Yes, 1,000,000 divided by 3600 is indeed about 278, but why would we assume it was 278 notifications exactly every second?

Then they should have said they can handle a million push notifications a second.

That doesn’t make sense?

I think very few people, even on Hacker News, are so pedantic to argue strongly that someone is lying when they call 999,999 notifications in one second "a million".

Queue it. It's not particularly urgent that push notifications go out instantaneously. A 99th percentile latency of 5 minutes is fine.

Not if it’s your ride waiting for you in short-term parking space, or a food delivery person at your doorstep.

> or a food delivery person at your doorstep.

That's what doorbells are for.

Haha I look forward to an IoT doorbell company blogging about how they send 2 million door bell rings per second

Nothing says you must handle them all at once though. You could queue and handle as your system allows... still reaching a 1 million per hour throughput.

Sure, but just because the average is 278 per second doesn’t mean the events happen at that rate. Bursts are way harder to process, which is much more typical. But someone writes an article like “How We Handle X things every Y”, and inevitably someone does basic division and surmises that it shouldn’t be so hard.

> Bursts are way harder to process

They aren't though. There are very few use cases that aren't bursty, and a plethora of reliable queueing systems exist.

Queues eliminate bursts, bursts are not harder to process unless the response must by synchronous.

I used to have pretty much the same reaction as most of the comments here on gojek engineering posts. Coming from a telco background the RPS figures were quite low while the complexity of architecture seemed completely over the top.

However, I now know more about the company. They have one of the most lean engineering teams of comparable companies here. Their CTO & VP Eng are incredibly practical guys whose advice would resonate with pretty much everyone here (Eg interview here[0] where they repeat the dangers of scaling too fast). They have small multifunctional product teams and the microservices architecture fits to this (as opposed to building it for fanciness / resumes sake). And they did it. Theres dozens of "features" in the app which are entire giant businesses in their own right.

So Id imagine the shortest path to getting this up and running this was definitely considered, would have been nice to go into more detail as to why simpler / outsourced solutions would work but id guess theres a good reason.

[0] https://www.youtube.com/watch?v=He0XBBfCEVk

> but id guess theres a good reason.

Sometimes there are no good reasons, and we should have the ability to question it sometimes, that's how progress is made, in science and in engineering.

I'm currently implementing a notification system for our network of sites. Aside from the HTTP microservice, my implementation is about the same. Which isn't much of a surprise - I expect this is a fairly common way of doing it.

With that said, I was surprised to learn how complicated it can be to send notifications cross platform. Fortunately we're targeting the web, so I really only have to worry about two push provider implementations: APNs and VAPID. I really hope Apple agrees to implement VAPID sooner than later so developers can stop wasting their time.

We have used OneSignal in the past, but there's something so satisfying about delivering the notification yourself. Also, a word of caution for people new to push notifications: we discovered early on that sending a million push notifications all at once is a really good way of crashing your site, as they have a surprisingly high click through rate in the first minute of sending!

Why not use Amazon SNS? They abstract away the differences between APNS and whatever Google are calling their push infrastructure these days.

I don’t mean to discount the work done here, but why would a dev team build this themselves these days? There are many practically infinitely scalable services that do just this with various levels of sophistication.

From the simple AWS SNS ($0.50/million messages) or Google Firebase (free) solutions that do multi-device push messaging whilst managing keys, redelivery and delivery responses to the more managed services like Urban Airship and OneSignal.

The actual Android delivery component has to end up going through Firebase anyway.

The only part of this solution SNS and Firebase don’t provide are the user fan-out functionality.

Surely there are other more important features delivery business value to be done rather than re-inventing (and then scaling and supporting) the delivery of mobile notifications.

I cofounded OneSignal for this reason. We were previously a game studio that built an in-house solution. It was a huge maintenance headache and a distraction. Eventually, we realized that so many companies were one-off push implementations and decided to focus our efforts on building a really great third-party solution.

Firebase and SNS have since entered this market as well. I'm obviously biased, but I honestly can't recommend them. They work for some of the basics, but there's been very little innovation in either product and occasional service issues that go unsolved for a long time. They also don't make any money from them, which should be a concern for anyone that plans to use these services for such a core part of their business.

Pretty soon your app is just a bunch of proprietary cloud provider products glued together and now there’s vendor lock-in and much less flexibility.

You might want to throttle messages based on user settings or say ML models to avoid too many annoying messages. For critical stuff you might want to switch over to email or SMS.

These are startups that have the funding to do this kind of thing. This is a core part of the stack for any such company.

If that's your concern, lean on Firebase.

You have to use Firebase to deliver messages to Android anyway, having it deliver to APNS as well you get for free with that implementation.

Maybe you do want to do those things, but this implementation doesn't, nor do they mention the intention to do that. But if you did, SNS also provides implementations for delivering SMS and Email (preferably through SES).

Startups shouldn't waste their money on this type of thing, especially if you consider most startups will fail. Vendor lock-in is a pretty stupid thing to worry about for an early-stage start-up (and I argue it's stupid to worry about long-term too if that vendor is AWS, Google or Azure who are all very competitive with each other).

Having a middleware to abstract firebases API from the services makes sense if just to minimize the disruption the next time Google changes it. I've always left tokens up to the callers to sort out before calling the notification service, but I see the appeal of centralizing it if you have a lot of small products. I agree there is no longer any good reason to directly target APNS.

Nothing in the article really said how they got to millions an hour. In my experience the notifications are cheap enough to send that that figure is easy to obtain without any special effort on the push side. It's also embarrassingly parallel/easy to scale. How everything else scales around that (recipient filtering, reporting) is the interesting part.

I’m not disagreeing just want to point out that this company has raised 3 billion dollars in total. So imo it’s worth it for them they have the money. I wouldn’t do all of this starting out.

Gojek is way past the “most startups will fail” stage though.

Efficiently handling million of events per second (not per hour) using whatever cloud PaaS is a no brainer these days. The lock-in is worth it when you can reliably move that fast.

If you're really worried about lock-in, most event based PaaS support running containers. Moving between clouds would then just be the glue-work around your containers.

Honestly, how many companies DON'T rely on some proprietary cloud provider at some point?

* SMS - You can't just interface directly with some cell company (which is proprietary anyways). * EMAIL - Have you tried to send your own emails in the past decade? Just use a provider. * Push Notifications - Have to use Firebase for Android anyways.


Your core stuff should limit proprietary stuff. Edges are going to change over time no matter what, might as well use stuff that makes you faster now.

> * SMS - You can't just interface directly with some cell company (which is proprietary anyways).

Wut? People don't even know anymore that you can use cellphones to send SMS?! (And no, that's not the only option, SS7 works just fine as well.)

> * EMAIL - Have you tried to send your own emails in the past decade? Just use a provider.

Have you? Talking nonsense does not make expertise ...

Email in particular. Sending your own isn't the problem. Managing your reputation, deliverability, etc just isn't fun to deal with.

I don't think vendor lock-in is what small startup should worry about.

They are not small though. Gojek is big in Indonesia specifically

Gojek is a super app in Indonesia serving a population of 264 million. They can afford this level of engineering and are required to do it, to stay competitive. They have millions of motorcycle driver (called gojeks) for whom they are trying to find jobs while idling waiting for passengers to transport.

If uber built these no one will question them why don't they use some off the shelf solution. Gojek is a google cloud customer and under their startup program - surge. So I am sure they are aware of the solution proposed.

They don't call themselves technology company but company which helps gojek drivers to find jobs and earn extra. In the process they became a food delivery, transportation, payments, courier company. In spite of this they still focus on their primary objective to find jobs for gojek and let them earn extra for better living.

Before Gojek app most of the drivers were just idling waiting for passengers, now with a $25 Kai OS feature phone [1], they can just receive a message and earn instead of just idling, they don't have to use iPhone or Android or some heavy OS. Indeed based on social survey some of the drivers could increase there earning 3-4 times based on how willing are they to take up additional jobs.

I believe the beauty of Gojek lies in its model to focus on finding jobs and developing solution to help those motorcycle drivers using technology. Although they extended it to cars. I am intrigued by their idea since its not a copy of Uber, but a solution to a problem local to Indonesia, which can extend to any country in the world finding jobs to do while idling and earn more.

[1] https://en.wikipedia.org/wiki/KaiOS

I don't think the other answers here question the impact Gojek has on millions of people.

> If uber built these no one will question them why don't they use some off the shelf solution.

If it's over-engineered, they probably would.

Disclaimer: I work at Gojek.

> They can afford this level of engineering and are required to do it, to stay competitive

> If uber built these no one will question them why don't they use some off the shelf solution

Uber and Gojek don't have the same talent pool so we can't compare them as Apple-to-Apple.

It is a risky move for Gojek and not for Uber because of the talent pool in their respective area.

Why would you think Gojek will not have a talent and cannot manage engineering?

If you have used their app and done a review of their platform comparing it with Uber, can you please share.

Indeed Uber did try to enter Indonesia they failed, they were out of most of South East and East Asia, because they don't have enough engineering talent to build a system for those specific countries, local companies like Grab in Singapore, Gojek in Indonesia, Didi in China beat them. So why would you think those companies do not have a talent to build systems better suited to their own environment than Uber.

You guys missing the bigger picture. Uber didn't fail in Southeast Asia. The competition was so steep that Uber and Grab were losing a lot of money through aggressive marketing. I know how aggressive they were as I'm from Malaysia. So, Uber and Grab came to an agreement, where Uber sold off their business to Grab for 27.5% stake and a seat on the board. It have nothing to do with talent. It's pure business. Instead of lose-lose situation, they created a win-win situation for both.

> Indeed Uber did try to enter Indonesia they failed, they were out of most of South East and East Asia, because they don't have enough engineering talent to build a system

Because they don't understand the local market, period. Nothing more than that. It has zero relationship with Uber not able to hire engineering talent. They simply don't know the market.

If you put the smartest engineers in the room but they don't know the business domain, they will fail from Day-1. I think people put Engineers over everything else where in majority of the situations, business-domain-knowledge trumps engineering excellence.

That is one reason. The other reason is that Local will choose Local (China with tons of Chinese only service). Indonesian President, Mr. Joko Widodo, for sure will promote Gojek more than Uber.

> So why would you think those companies do not have a talent to build systems better suited to their own environment

You have to understand the context here.

Why do you think Gojek opened a huge "lab" in India?

Why do you think Grab opened a "lab" in Seattle and hired Steve Yegge?

You may not realized this but they are not the only South East Asia "Unicorn"/"Decacorn" who opened R&D Labs outside their "homebase".

I can't share much of the details why these companies (_and_ a few others) decided to do so but feel free to guess a bit here and there.

> Because they don't understand the local market, period.

Uber couldn't iterate the platform and technology according to local market requirements and conditions, engineering is one of the reason for failure. They then tried with money and didn't work either. It's a combination of business and technology platform due to which Uber failed.

If you prefer to live in a bubble its fine, there is no better engineering talent anywhere in this world except Uber. But ground truths are already there in the markets from which they retreated.

Gojek opened centers in India because it's cheaper than Singapore.

Grab costs in Singapore and Malaysia is higher too compared to India.

So like any other USA firm which has opened centre in India to save costs, they did too.

Eh... Grab opened an R&D lab in Seattle and hire Google folks. That's breaking the wallet big time in terms of employee compensation.

Grab isn't the only one from SE Asia that opened the lab in North America. There are at least 2 more from Indonesia that opted North America instead of Singapore/Malaysia/India.

Can't share details other than "talent pools just not there".

Feel free to disagree with the execs.

> Grab opened an R&D lab in Seattle and hire Google folks. That's breaking the wallet big time in terms of employee compensation.

Sure they will open in Seattle as it's 30% or more cheaper than Singapore. Grab's largest R&D team is in Singapore. They did bring in talent from around the world here. But as the cost in Singapore is very high, to get an expat is very expensive. [1]

So it does make sense to open in cheaper location an R&D, it also shows they have engineering talent from around the globe.

[1] https://www.businessinsider.sg/most-expensive-cities-in-the-...

Cheaper in Seattle? I don't have the exact number how expensive it is to hire folks in Singapore but I can tell you this: those google folks cost at least 150k USD base without bonus and RSU. Someone like Steve Yegge will have total compensation north of $500k USD at Google so Grab should at least pay him 700k-1M USD.

Google Fresh Grad total compensation should be between 150k - 300k usd in Mountain View. Fresh grad. Seattle usually cost 10-15% less total compensation compare to Mountain View.

Heck even recent Microsoft offer for fresh grad base:110k, bonus 10-20%, and rsu. That's MSFT, not FAANG. FAANG will give you signon bonus on top of that ranging from 50k-100k.

Keep in mind Grab RSU is paper money at the moment.

I highly doubt Singapore engineers, given the same level, make that much. Probably only the top 0.5%.

I don't see a good discussion as you still think that these folks don't come to North America and open R&D lab because it's cheaper than Singapore since that logic does not make any sense out of the gate: India is way cheaper, why not grow there and don't even bother opening a lab in Seattle. You do know that it is damn hard to grow an R&D Lab in the USA due to regulations and immigration policies right.

In order to avoid repeating the facts on the field, I would suggest you to reach out to the executives in those companies and ask why they do so as I have done before to avoid confusion, assumptions, and guessing games.

One big factor is being free from bundling proprietary, vendor-specific libraries that may also leak user's information or otherwise increase the security risk footprint of your app. Having control over the services that make up the core of your business, as opposed to outsourcing and relying on 3rd-parties, can be very freeing and also cost effective in the long run.

Once you're hosting on AWS you've pretty much already crossed that particular bridge.

Engineering Pride (and exercise for future employment).

While this might sound condescending but hear me out:

- if you keep doing the same thing like everybody else out there, gluing APIs from cloud-vendor, you won't stand out.

- if you're a company of _that_ size (Uber, Gojek), you gotta do something big because the name/brand is associated with "Engineering" excellence

- if you're a leader (or in management), would you tell your employee to "suck it up" and do the boring stuff like everybody else? or would you empower them to do more? hype them up? this is a strategy for keeping them happy too.

I don't begrudge that. That's how companies evolve: they start small and scrappy, they grow very fast, and they're at the level where "doing simple stuff" won't get you promotion or you won't have something cool to show during the interview for your next job.

It's work dynamic.

Did you read the post? They're using Firebase messaging, but that only covers a part of their workflow.

They’re only using Firebase for Android delivery (which is mandatory anyway). But it is just as happy to deliver to APNS at $0 as well with near infinite scalability.

The actual delivery component of their messaging system is unnecessary.

``` Surely there are other more important features delivery business value to be done rather than re-inventing (and then scaling and supporting) the delivery of mobile notifications. ``` the business value the service provided is that this allow the engineering organization to scale. Gojek has couple hundred engineers, split across multiple teams, where each teams own multiple services/products/applications, each with their own development cycle. Having a single abstraction provided as a service helps to mitigate the risks involved in changes to implementation details and external dependencies (e.g. GCM deprecation) since it abstracts it away completely.

It's not hard to do what they did, so I'm not surprised they did it themselves. It's just simple async queue processing, and doesn't really need a lot of sophistication, especially at just 1M notifications per hour. Now if they need to process a few orders of magnitude more, then AWS SNS would probably be a good next step.

I had a blast using Go-Jek and Grab when I visited Indonesia. It’s basically Uber/Lyft, except you sit on the backseat of a scooter rather than a car.


Of course as others mentioned it’s a super app so they do many other things, but I believe this is one of their main businesses.

Go-Jek has been huge for us as holiday rental managers in Bali. But lately there seems to be some sort of racket by drivers:

Example for Go-send (sending a package): "I have motorbike problem please can you cancel".

A minute after cancelling I get a message on WhatsApp: "Hey fixed my motorbike, still need?". Same thing with food orders.

Example for the Go-food: "Hey I'm at the restaurant, the phone is down. Can you cancel order? I will do manually".

Obviously the restaurant and the driver gets the full customer payment this way. Go-jek gets zero. This depends on the customer cancelling as otherwise the drivers rating takes a hit.

Wow that's really clever!

The average monthly full time salary in Bali for a local is around $270 USD. A Gofood order could easily be $100. Last time I checked Gojek takes 20% of the order value. On a $100 order that could get them $10 (assuming they split the $20 evenly with the restaurant). One a day for a month and you've made an additional average monthly salary. If you think about that in terms of a Western IT annual income (assume $60k) that's an extra $5k a month.

The title of the blog is actually a click-bait.

The point of the article was more that when you have multiple teams, each owning multiple products, having a single well-defined abstractions over external dependencies provided as a service that you control is important to manage risks posed by those external dependencies and to make overall maintenance easier.

This is cool, but I found this things called MQTT protocol and Emitter.io is one implementation of it. You can build your own notification system on top of it for a much cheaper cost.

Why is less than 300tx/sec a such a big deal?

I appreciate the positive effort to communicate something useful, but this basically came down to "We use RabbitMQ".

I've seen a few good example blog posts discussing this scale, the 10 to 500 transactions per second and learned a lot from all of them.

Does anyone know of any description of architectures scaling to the next magnitude? 1000 to 10 000 TPS ? And even higher to say 100 000 to 1 000 000 TPS ?

Its difficult to "guess" ahead of the fact how you are going to scale for the next 10x, because you need to know.

a.) Which component is going to start breaking, and that depends on the usage pattern.

b.) Which business/tech compromises are ok to make, and that depends on a.)

Generally speaking though, you'd try to benchmark the system to find the bottleneck component, and based on the nature of it, either try to throw more hardware (horizontal or vertical scaling) or optimize the software.

We handle peaks of 800k tps in few systems. It is for an analytical platform. Partition in Kafka by some evenly distributed key, Create simple apps that read from a partition and process it, commit offset. Avoid communication between process/threads. Repartition using kafka only. For some cases we had to implement sampling wherein usecase required highly skewed partitions.

I thought this would be about implementing custom notification server and not using google's or apple's.

Is that a lot of notifications ?

I have done similar like. I think I do much better than those. But, I don't pass their interview haha.

1 million notifications per hour, or ~300 per second, seriously?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact