We recently hit a peak of 850 Million notifications per second, and 5 billion notifications per day. Here's a blog post on how we do it. Written back when we were at "only" 2 billion notifications per week: https://onesignal.com/blog/rust-at-onesignal/
> We recently hit a peak of 850 Million notifications per second
From your blog "OnePush is fast - We've observed sustained deliveries up to 125,000/second and spikes up to 175,000/second."
I think you may have a typo. The bandwidth would be incredible too, if it were an unlikely 10 bytes per delivery that would be 8.5 GB/Sec.
It's generally under 4 bytes per delivery, depending on the content, and we have several delivery servers. APNS, for example, doesn't support payloads larger than 4 bytes.
iOS, however, is still quite challenging at scale since there's no batching mechanism for APNS, so even bandwidth becomes a bottleneck. We do web push too, which requires a lot of CPU cycles to encrypt each payload for its recipient.
The architecture seems rather overengineered considering a single raspberry pi could do the job, even after 100x scaling!
WSE back then had a limit of 10k messages per second (as in one message every 1/10000th of a second). Messages came by two separate network operators so that was 20k messages to be deduplicated to 10k operations. Incomming messages came in compressed so they required uncompressing.
Responding to the message required complex processing and then may have resulted in an order to market which, again, had to be constructed and validated.
All this worked on a single server (regular two-socket Xeon-based server).
There was 10us (microseconds..) time budget to send response to the market and it had to work every time even under maximum possible load (10k messages per second).
Routing and forwarding 300 messages per second doesn't seem like something to brag about...
That said, it really doesn't sound too difficult with straightforward architecting (says the armchair critic).
To start, there's nothing here about what machine this architecture runs on, it could be running on a Raspberry Pi for all we know.
Then, this ignores the cost of database lookups for the keys. That data is probably small enough to be on the one machine, but then you have to have service support for (reliably, in real time) syncing that data to the service. A separate database is therefore probably the right solution here, which means you're doing networking in each message send, which makes it unlikely that a Pi could do this.
Next up you've got the issues of reliability. The message queue separation gives you better reliability in the face of issues such as upstream APIs going down or erroring, or issues for a specific user. All the business logic around handling this, the message queue handling persistence and ACID semantics (or parts of it), this all takes additional resources, not to mention potentially a fair bit of disk space (for a Pi) to queue up undelivered messages should an upstream API slow down or stop accepting new messages.
Then you have hardware failure, at this scale you don't want a single machine failure to wipe out your primary communication method with millions of customers. You'd therefore want to have a distributed system, even if that's only for reliability rather than performance.
Lastly, 21 million clock cycles might sound like a lot, and might go a long way with C/C++/Rust, but as you move up to more dynamic languages that will reduce significantly. It happens that they are using Go here, and that's likely to get pretty good performance out of the hardware, but writing this service in Python/Ruby would be a very valid choice for developer productivity, or based on existing skills they have in the team. That might be 1/10th the performance, but since you need a distributed system for reliability here anyway, adding a few more machines to the pool might be a better choice than introducing a lower level language that takes longer to develop and exposes you to memory safety or threading issues.
There may well be other factors I haven't considered here, but I think for the use case of delivering that scale of messages, the reliability options you get with a system like this are well worth the additional hardware requirements and architecture overhead.
Edit: lmilcin makes a good point about trading systems, but there are several differences – that system still has a single point of failure, it was probably written in a low level language with input from experts on performance, and it was running on a much faster machine. The single point of failure of a server-grade machine like that is probably an acceptable risk if you own the hardware, but in a cloud environment (which brings other benefits) hardware is less reliable so probably not an acceptable risk there. I don't think it's an apples-to-apples comparison, although it is interesting.
> Then you have hardware failure, at this scale
It's really, really, not scale!
> Lastly, 21 million clock cycles might sound like a lot, and might go a long way with C/C++/Rust, but as you move up to more dynamic languages that will reduce significantly
Then you're prob using the wrong lang.
I don't really accept that level of engineering is necessary all round, unless the business case requires it, and then I'd speak to whoever put those business requirements together and ask hard questions.
Talking to 1 million different people an hour is _business_ scale. Regardless of the tech required to do that, it not working would likely be a significant business impact.
> Then you're prob using the wrong lang.
There's so much more here than performance. There's developer productivity, there's tooling availability, all sorts. They happen to be using Go which is probably the best trade-offs for this particular system, but if you were doing complex machine learning you'd probably want to use Python due to all the excellent tooling available, and that means having a "slow" language for parts that aren't optimised for you.
If the difference is 1 engineer, ~1k lines of code, 3 machines, vs 3 engineers, ~10k lines of code and 1 machine, the former is likely to be the right trade off for most companies. It's cheaper to build, and since number of bugs typically correlates to lines of code, it will likely be much more reliable.
As for the ACID semantics, you're right you wouldn't need them all here, redelivery is probably fine within some bounds, and the upstream APIs might even have idempotency tokens to prevent this, but the downtime of losing a machine for a few hours, the time to regenerate all those messages that were lost on the downed machine, etc, could equate to quite a bit of downtime and poor UX for users.
You're not wrong from a performance perspective, but taking into account reliability, business impact, user experience, and developer productivity/costs, I think the solution in the blog post is a better set of trade-offs than you're suggesting.
Replace raspberry pi with any local server configuration.
One million per hour? I forgot how to count that low.
That's what doorbells are for.
They aren't though. There are very few use cases that aren't bursty, and a plethora of reliable queueing systems exist.
However, I now know more about the company. They have one of the most lean engineering teams of comparable companies here. Their CTO & VP Eng are incredibly practical guys whose advice would resonate with pretty much everyone here (Eg interview here where they repeat the dangers of scaling too fast). They have small multifunctional product teams and the microservices architecture fits to this (as opposed to building it for fanciness / resumes sake). And they did it. Theres dozens of "features" in the app which are entire giant businesses in their own right.
So Id imagine the shortest path to getting this up and running this was definitely considered, would have been nice to go into more detail as to why simpler / outsourced solutions would work but id guess theres a good reason.
Sometimes there are no good reasons, and we should have the ability to question it sometimes, that's how progress is made, in science and in engineering.
With that said, I was surprised to learn how complicated it can be to send notifications cross platform. Fortunately we're targeting the web, so I really only have to worry about two push provider implementations: APNs and VAPID. I really hope Apple agrees to implement VAPID sooner than later so developers can stop wasting their time.
We have used OneSignal in the past, but there's something so satisfying about delivering the notification yourself. Also, a word of caution for people new to push notifications: we discovered early on that sending a million push notifications all at once is a really good way of crashing your site, as they have a surprisingly high click through rate in the first minute of sending!
From the simple AWS SNS ($0.50/million messages) or Google Firebase (free) solutions that do multi-device push messaging whilst managing keys, redelivery and delivery responses to the more managed services like Urban Airship and OneSignal.
The actual Android delivery component has to end up going through Firebase anyway.
The only part of this solution SNS and Firebase don’t provide are the user fan-out functionality.
Surely there are other more important features delivery business value to be done rather than re-inventing (and then scaling and supporting) the delivery of mobile notifications.
Firebase and SNS have since entered this market as well. I'm obviously biased, but I honestly can't recommend them. They work for some of the basics, but there's been very little innovation in either product and occasional service issues that go unsolved for a long time. They also don't make any money from them, which should be a concern for anyone that plans to use these services for such a core part of their business.
You might want to throttle messages based on user settings or say ML models to avoid too many annoying messages. For critical stuff you might want to switch over to email or SMS.
These are startups that have the funding to do this kind of thing. This is a core part of the stack for any such company.
You have to use Firebase to deliver messages to Android anyway, having it deliver to APNS as well you get for free with that implementation.
Maybe you do want to do those things, but this implementation doesn't, nor do they mention the intention to do that. But if you did, SNS also provides implementations for delivering SMS and Email (preferably through SES).
Startups shouldn't waste their money on this type of thing, especially if you consider most startups will fail. Vendor lock-in is a pretty stupid thing to worry about for an early-stage start-up (and I argue it's stupid to worry about long-term too if that vendor is AWS, Google or Azure who are all very competitive with each other).
Nothing in the article really said how they got to millions an hour. In my experience the notifications are cheap enough to send that that figure is easy to obtain without any special effort on the push side. It's also embarrassingly parallel/easy to scale. How everything else scales around that (recipient filtering, reporting) is the interesting part.
If you're really worried about lock-in, most event based PaaS support running containers. Moving between clouds would then just be the glue-work around your containers.
* SMS - You can't just interface directly with some cell company (which is proprietary anyways).
* EMAIL - Have you tried to send your own emails in the past decade? Just use a provider.
* Push Notifications - Have to use Firebase for Android anyways.
Your core stuff should limit proprietary stuff. Edges are going to change over time no matter what, might as well use stuff that makes you faster now.
Wut? People don't even know anymore that you can use cellphones to send SMS?! (And no, that's not the only option, SS7 works just fine as well.)
> * EMAIL - Have you tried to send your own emails in the past decade? Just use a provider.
Have you? Talking nonsense does not make expertise ...
If uber built these no one will question them why don't they use some off the shelf solution. Gojek is a google cloud customer and under their startup program - surge. So I am sure they are aware of the solution proposed.
They don't call themselves technology company but company which helps gojek drivers to find jobs and earn extra. In the process they became a food delivery, transportation, payments, courier company. In spite of this they still focus on their primary objective to find jobs for gojek and let them earn extra for better living.
Before Gojek app most of the drivers were just idling waiting for passengers, now with a $25 Kai OS feature phone , they can just receive a message and earn instead of just idling, they don't have to use iPhone or Android or some heavy OS. Indeed based on social survey some of the drivers could increase there earning 3-4 times based on how willing are they to take up additional jobs.
I believe the beauty of Gojek lies in its model to focus on finding jobs and developing solution to help those motorcycle drivers using technology. Although they extended it to cars. I am intrigued by their idea since its not a copy of Uber, but a solution to a problem local to Indonesia, which can extend to any country in the world finding jobs to do while idling and earn more.
> If uber built these no one will question them why don't they use some off the shelf solution.
If it's over-engineered, they probably would.
Disclaimer: I work at Gojek.
> If uber built these no one will question them why don't they use some off the shelf solution
Uber and Gojek don't have the same talent pool so we can't compare them as Apple-to-Apple.
It is a risky move for Gojek and not for Uber because of the talent pool in their respective area.
If you have used their app and done a review of their platform comparing it with Uber, can you please share.
Indeed Uber did try to enter Indonesia they failed, they were out of most of South East and East Asia, because they don't have enough engineering talent to build a system for those specific countries, local companies like Grab in Singapore, Gojek in Indonesia, Didi in China beat them. So why would you think those companies do not have a talent to build systems better suited to their own environment than Uber.
Because they don't understand the local market, period. Nothing more than that. It has zero relationship with Uber not able to hire engineering talent. They simply don't know the market.
If you put the smartest engineers in the room but they don't know the business domain, they will fail from Day-1. I think people put Engineers over everything else where in majority of the situations, business-domain-knowledge trumps engineering excellence.
That is one reason. The other reason is that Local will choose Local (China with tons of Chinese only service). Indonesian President, Mr. Joko Widodo, for sure will promote Gojek more than Uber.
> So why would you think those companies do not have a talent to build systems better suited to their own environment
You have to understand the context here.
Why do you think Gojek opened a huge "lab" in India?
Why do you think Grab opened a "lab" in Seattle and hired Steve Yegge?
You may not realized this but they are not the only South East Asia "Unicorn"/"Decacorn" who opened R&D Labs outside their "homebase".
I can't share much of the details why these companies (_and_ a few others) decided to do so but feel free to guess a bit here and there.
Uber couldn't iterate the platform and technology according to local market requirements and conditions, engineering is one of the reason for failure. They then tried with money and didn't work either. It's a combination of business and technology platform due to which Uber failed.
If you prefer to live in a bubble its fine, there is no better engineering talent anywhere in this world except Uber. But ground truths are already there in the markets from which they retreated.
Gojek opened centers in India because it's cheaper than Singapore.
Grab costs in Singapore and Malaysia is higher too compared to India.
So like any other USA firm which has opened centre in India to save costs, they did too.
Grab isn't the only one from SE Asia that opened the lab in North America. There are at least 2 more from Indonesia that opted North America instead of Singapore/Malaysia/India.
Can't share details other than "talent pools just not there".
Feel free to disagree with the execs.
Sure they will open in Seattle as it's 30% or more cheaper than Singapore. Grab's largest R&D team is in Singapore. They did bring in talent from around the world here. But as the cost in Singapore is very high, to get an expat is very expensive. 
So it does make sense to open in cheaper location an R&D, it also shows they have engineering talent from around the globe.
Google Fresh Grad total compensation should be between 150k - 300k usd in Mountain View. Fresh grad. Seattle usually cost 10-15% less total compensation compare to Mountain View.
Heck even recent Microsoft offer for fresh grad base:110k, bonus 10-20%, and rsu. That's MSFT, not FAANG. FAANG will give you signon bonus on top of that ranging from 50k-100k.
Keep in mind Grab RSU is paper money at the moment.
I highly doubt Singapore engineers, given the same level, make that much. Probably only the top 0.5%.
I don't see a good discussion as you still think that these folks don't come to North America and open R&D lab because it's cheaper than Singapore since that logic does not make any sense out of the gate: India is way cheaper, why not grow there and don't even bother opening a lab in Seattle. You do know that it is damn hard to grow an R&D Lab in the USA due to regulations and immigration policies right.
In order to avoid repeating the facts on the field, I would suggest you to reach out to the executives in those companies and ask why they do so as I have done before to avoid confusion, assumptions, and guessing games.
While this might sound condescending but hear me out:
- if you keep doing the same thing like everybody else out there, gluing APIs from cloud-vendor, you won't stand out.
- if you're a company of _that_ size (Uber, Gojek), you gotta do something big because the name/brand is associated with "Engineering" excellence
- if you're a leader (or in management), would you tell your employee to "suck it up" and do the boring stuff like everybody else? or would you empower them to do more? hype them up? this is a strategy for keeping them happy too.
I don't begrudge that. That's how companies evolve: they start small and scrappy, they grow very fast, and they're at the level where "doing simple stuff" won't get you promotion or you won't have something cool to show during the interview for your next job.
It's work dynamic.
The actual delivery component of their messaging system is unnecessary.
Of course as others mentioned it’s a super app so they do many other things, but I believe this is one of their main businesses.
Example for Go-send (sending a package):
"I have motorbike problem please can you cancel".
A minute after cancelling I get a message on WhatsApp:
"Hey fixed my motorbike, still need?". Same thing with food orders.
Example for the Go-food:
"Hey I'm at the restaurant, the phone is down. Can you cancel order? I will do manually".
Obviously the restaurant and the driver gets the full customer payment this way. Go-jek gets zero. This depends on the customer cancelling as otherwise the drivers rating takes a hit.
The point of the article was more that when you have multiple teams, each owning multiple products, having a single well-defined abstractions over external dependencies provided as a service that you control is important to manage risks posed by those external dependencies and to make overall maintenance easier.
Does anyone know of any description of architectures scaling to the next magnitude? 1000 to 10 000 TPS ? And even higher to say 100 000 to 1 000 000 TPS ?
a.) Which component is going to start breaking, and that depends on the usage pattern.
b.) Which business/tech compromises are ok to make, and that depends on a.)
Generally speaking though, you'd try to benchmark the system to find the bottleneck component, and based on the nature of it, either try to throw more hardware (horizontal or vertical scaling) or optimize the software.