> We also decided that if a provider is slow to deliver messages, measured in the same way as before, we would reduce their share of the load by 10 percentage points
When doing systems design this is a critical piece to include in almost all load balancing aspects. Because a ton of time you will start with 100% of traffic load balanced between two boxes but what happens when one box fails or slows down? You don't want to overwhelm anything which leads to cascading failures.
If you want them to mean something, you need to have a direct connection with the carrier of your user, and you have to know that the carrier or their network doesn't just fake positive delivery receipts at some point in the system to make numbers look good. With an aggregator between you and the carrier, you have no way of knowing when the delivery path changed and the intermediary faked a delivery report.
The absence of a positive delivery receipt doesn't really mean anything either. Negative delivery reports have some information content though.
My context is sending verification codes though; receiving a code back from a user is a much better measure of successful delivery to the user than a delivery receipt. If you're sending news or something where there's no measurable direct action taken as a result of the message, I guess a delivery receipt is better than nothing, maybe?
Realtime measurement and thompson sampling or multi-armed bandit with as many credible providers as you can manage is the best way forward, don't send retries through the same provider either.
I'm surprised you'd frequently get false positives. What regions were you sending to?
They were essentially unheard of with the UK networks and very rare with European networks. It was only Middle East and African networks that were troublesome, but they represented a tiny fraction of our traffic. Even then, false positives were rare compared to messages just not getting delivered.
Once again, if you are receiving texts that must mean you opted into them. User is expecting them and it’s different from a user receiving two texts never having signed up for them.
I do believe this technique uses some out of the box thinking and mostly DOES solve OPs problem.
They really nailed the aesthetic and the consistency is unmatched.
Edit: I'm impressed at this tooling.
(I helped to start the current incarnation of that, and it’s probably the thing in my career that I’m most proud of.)
I'm so pleased that we've brought some of it in house with GDS, and I'm amazed (in a good way) that it actually works, and they are even part of the OSS community.
Departments actually prefered spending ££££££ with HP / Capita / CapGemi etc. because the big outsourcing companies didn't ask as many 'awkward' questions (about accessibility, for example), and the departments got to 'own' the product more.
I've been on UC (edit: Universal Credit) a while and when I was first on it would lose messages (on both my side and the jobctr officials, and it has frustrated them because I talk to them Edit: and they have their own usability frustrations with it, with little or no way to feed them back up), log you out even before timing out, lose appointments or have them be set but not show up for you, and more.
There's plenty more I could say but I'll just leave something which sums it up. At the top it says "BETA This is a new service - your feedback will help us to improve it". Guess what, it's disabled so you cannot give feedback. It always has been disabled.
Visually pretty good but in terms of usability, not so good. Pretty sure no proper testing was done on it, though it has been getting a lot better recently.
The first version was on Microsoft Dynamics (yes a CRM...) for whatever reason iirc.
From a purely technical perspective, you would just distribute requests inversely proportional to response time. Probably under low load, one provider would get all the requests, and only in an outage or overload scenario would the other provider take the rest.
Where did that constraint come from? Did I miss it in the article? Their initial approach, after all, had all of their load going to a single provider.
Gotta keep the service provider happy to ensure they still go along with the program.
The problem is you pick a SMS aggregator. They all tell you they have global coverage, and direct routes. They all tell you that their routing algorithm is the best of all the aggregators. And they're all full of BS.
If they weren't full of BS, I wouldn't have had a nice job managing verification code sending for a big messaging company though, so I guess it worked out for me. :P If SMS worked in general, the messaging company probably wouldn't have existed.
“Developed and researched a novel algorithm to reliably send messages under intense load”
“Tried 5 off the shelf solutions, picked 1 that seemed okay, and moved on with my life”
That’s why we keep reinventing the wheel. Also it’s fun and we all think we’re the smartest.
Neigher. Focus on achieved business outcomes instead.
"Scaled GOV.UK Notify to 15 messages per second, ensuring reliable and timely delivery of 2FA codes and flood emergency warnings."
"Developed and researched" tells me you did something interesting but I can't tell if it was resume padding or real work. If you can't articulate why, that's a red flag. A good hiring manager should be able to detect resume padding and avoid hiring people that waste company resources on pet projects.
What I also really like about Gov.uk is that they seem to have their apps open source
When you create new source code, you must make it open so that other developers (including those outside government) can:
- benefit from your work and build on it
- learn from your experiences
- find uses for your code which you hadn’t found
Here is another solution: smstools (https://packages.debian.org/buster/smstools), and a bunch of SMS modems pluggined into a USB hob. A SMS modem can send SMS in under 5 seconds. They say 200,000 a day max, so lets say you want to cope with sending 200,000 in 4 hours.
That's a little under 14 per second, so lets say 15 per second. You need 75 modems to do that, or about AUD$4,000 worth of modems. Sorry for the AUD$ - I'm Australian. You will also need SIM's - or about AUD$1,500 worth. Don't worry about having 75 modems in one spot, the mobile phone network is designed to cope with a stadium of people all sending at the same time.
Perhaps you want to shard it for reliability - maybe 5 machines, so add another AUD$5000 for a NUC or similar all with a minimal Debian install + a web server or whatever for whatever delivery mechanism you are going to use to get the SMS's to the servers. That's AUD$10.5K total. Write some glue code - which is a week tops and job done.
The one question I'd be asking is how does that compare to using the cloud. Third party providers charge around AUD$0.05 per SMS. They say a minimum of 100,000 SMS's per day - or AUD$150K / month. The cost for the non cloud solution is AUD$10.5K for the first month, then $1,500 / month after that for the pre-paid SIMS.
Downsides: when it breaks (and it will), you will have to diagnose what's going on. That can be hard if the cause is a welding shop starting up next door. You are also going have to deal with the telco's screwing up their SMS infrastructure which seems to happen in Australia every 12 months or so. But you can fight that to some extent by geographically distributing your NUC's and using several different telco's servicing each NUC. That way it becomes more obvious what failed. Finally, instead of NUC's use industrial rated PC  to get your reliability up.
 https://www.ebay.co.uk/itm/283828240407 You need the version with an 'S' (for serial interface) suffix, although you can often just change the firmware.
 https://fit-iot.com/web/products/fitlet2/, industrial temperature rated. So, no stinking air conditioning required. :D
Maybe when I was young. But I have grey hair now. I'm sure some of it is grey because over the years I've made too many on off purchases of specialised boxes promising to do everything I needed at the time. The expensive, nerve racking disasters I've had in IT were caused by boxes like that - raid arrays that used some "high speed" proprietary format, IBM SCSI boxes that needed specialised IBM disks whose firmware bug happened to spray shit across the data flowing across the SCSI bus on occasion, hell even specialised Telco APN's they work for a few years then didn't, and after 6 months they admitted they had fired the people who set it up.
When that Hypermedia box died (and they all do eventually), I phone up the suppler - and a replacement was an order, payment, international freight, and customs away. That was weeks. So we are down for weeks.
The alternative is to make do with off the shelf retail components that is sold to ordinary punters every day. Yes, the components aren't as reliable. You also have to provide glue - but you write the glue or use open source, so visibility into problems is excellent and the response time is amazing. And they are dirt cheap, so you can keep a couple of hot spares on the shelf (as you would if you had 75 modems). Failing that getting new ones is just a case of going down to you local retail outlet and picking one off the shelf. And they actually have _less_ bugs, partially because there of millions of them out there, and partially because a retail brand will be overwhelmed if their channel fills up with failures.
I saw a misbehaving IBM SAN take out video production house once, and the 100 people who worked there. (Turns out movie length video editing pushes a SAN very hard.) I didn't make that purchasing decision, but I may well have back then. There but for the grace of god go I. I've came close enough as it is.
So no 128 x SIM boxes for me thanks.
For what it's worth, a one off expensive box is worth it if the purchase price includes a man carrying the requisite spares on your door step with 8 business hours when it breaks. Big companies like Dell, HP and IBM do that for their big iron. (In an amazing coincidence, the boxes they are willing to cover with that sort of service for 5 years at a moderate increment in the purchase price almost never break.)
The other time I'm now willing to purchase specialised boxes is if they are mostly open source (so I have visibility, and very little bespoke poorly road tested complex code), and I'm purchasing 100's of them so I can realistically price in keeping a bunch of them on the shelf in the initial purchase.
Apparently they could've set up a system like in Japan where your phone gets emergency alerts (which I actually experienced when I visited on vacation a few years ago on my iPhone from the UK) and are handled specifically by the mobile operating system but the government were too cheap to set it up.
It seems strange that it would be complicated or expensive to set up. I've used cheap, aimed-at-tourist SIMs in developing countries where cell broadcast was used to send news and adverts, which is very annoying.
I can't really remember the country, it was years / 20 countries ago. Possibly Vietnam.
Foreign SIM crosses the border? Here comes a flood of SMS spam!
If I remember correctly it was due to a poorly designed drop down menu & missing confirmation challenge box. "Test Send" and "Send Send" were right on top of each other in a 10point font lol.
"Millions of mobile users in the UK have yet to receive the government's text message alert about coronavirus.
The SMS - telling people to stay at home - began being sent early on Tuesday morning.
But Vodafone has confirmed it only expects to complete the process later this Wednesday...."
The OP blog post says
> we have 2 different [text message] providers
but the BBC article says
> Vodafone is the only one of the UK's big four networks that has not finished the task.
If the two articles are talking about the same thing then it’s worth noting that the other major providers seemingly had no problems, and Vodafone chose to not send the messages at night time to avoid waking people up.
It could be a limit, but I don't see anything that indicates there's a correlation.
I've used Notify on several projects over the last couple of years, it's a really nice service and has never caused us any problem whosoever.
I think it was intended to be a description of how it's sensible to set things up at the scale they happen to have.
I'd like to clarify that the remark about load handling was aimed at your SMS gateway providers.
My team found this pretty annoying for monitoring a Django site so we’ve ended up moving to a statsd push-based metrics approach and are finding numbers generally easier to trust and reason about.
As the postmortem mentioned, it wasn't related to any of this load balancing work or our providers, it was us running into trouble with a different part of our system. That was a busy week (both in terms of numbers as you can see on https://www.gov.uk/performance/govuk-notify/notifications-by... but also in terms of us fighting fires).
I would design it like this:
Put all requests into a distributed queue, for example persistent pubsub.
Have workers take work from that queue.
Each worker should send a new request to a provider if the rate of requests sent in the past minute is < 2x the rate of requests in the previous minute, and the number of in flight requests is < 10x the average of the past minute, and the rate of errors, including timeouts, is <1%. If both providers are eligible, send to whichever has had the fewest requests in 24h.
This prevents flooding/DoSing a badly configured provider (a well configured provider would have ingress ratelimiting, and you could do away with all the above logic).
Have alerting on the age of the oldest item in the queue, and a monitoring dashboard showing dispatch rate to each provider, with response error codes.
All the state is local to the worker, and doesn't need persisting. If a worker crashes, the item doesn't get acknowledged to pubsub, and will be retried. If you like, you can autoscale the number of workers based on their cpu utilization.
I'd expect the above to scale to 10k qps per worker, and 5Mqps for 1000 workers before needing a redesign.