
How Gov.uk Notify reliably sends text messages to users - robin_reala
https://gds.blog.gov.uk/2020/04/03/how-gov-uk-notify-reliably-sends-text-messages-to-users/
======
jtchang
One of the most important parts about this is that they applied backpressure
to the load balancing process:

> We also decided that if a provider is slow to deliver messages, measured in
> the same way as before, we would reduce their share of the load by 10
> percentage points

When doing systems design this is a critical piece to include in almost all
load balancing aspects. Because a ton of time you will start with 100% of
traffic load balanced between two boxes but what happens when one box fails or
slows down? You don't want to overwhelm anything which leads to cascading
failures.

~~~
duxup
Load balancers rule the world.

------
flibble
We are facing the issue of SMS not getting delivered and have implemented
failovers (Messagebird, Nexmo, Twilio) when we detect delivery problems.
However we have a problem that frequently we get sent positive delivery
receipts despite the SMS not being delivered. This makes it hard to know if we
should fail over. Has anyone good solutions fur this?

~~~
xeromal
Who do you get false positives from the most?

~~~
dsincl12
For me personally it was Messagebird. Two failed demos because of this and the
response from them where non-existent. I was even contacted by one of their
sales people after I wrote my last "angry I'm leaving" email asking me if I
was interested in using their services.

~~~
theblueprint
Yes, similar position with Messagebird. Lack of transparency in their
reporting/insights vs Twiliio. That said, our Twilio pricing is 2-3x
Messagebird.

------
sergiomattei
On a side note, I'm a huge fan of the standardized design language of all UK
Government sites.

They really nailed the aesthetic and the consistency is unmatched.

Edit: I'm impressed at this tooling.[1]

1: [https://www.gov.uk/service-toolkit#gov-uk-
services](https://www.gov.uk/service-toolkit#gov-uk-services)

~~~
robin_reala
That’s the GOV.UK Design System in play: [https://design-
system.service.gov.uk/](https://design-system.service.gov.uk/)

(I helped to start the current incarnation of that, and it’s probably the
thing in my career that I’m most proud of.)

~~~
zouhair
So that's where Canada.ca aesthetic comes from.

~~~
robin_reala
Also [https://www.govt.nz/](https://www.govt.nz/) and
[https://www.gov.au/](https://www.gov.au/).

------
hn_throwaway_99
I'm glad that they improved their delivery, but one thing I find frustrating
about our industry is how often we all seem to be reinventing the wheel. I
mean, there are _tons_ of well-described load balancing algorithms with
various pros/cons. From the article it sounds like they just figured out their
load balancing algorithm through trial and error, rather than researching load
balancing algorithms first, then tweaking them based on real world
performance.

~~~
Swizec
Yeah but what looks better on your CV?

“Developed and researched a novel algorithm to reliably send messages under
intense load”

Or

“Tried 5 off the shelf solutions, picked 1 that seemed okay, and moved on with
my life”

That’s why we keep reinventing the wheel. Also it’s fun and we all think we’re
the smartest.

~~~
rootusrootus
As a hiring manager, I know which one I'd rather see on a resume. Though I
understand why an individual would choose the former.

~~~
cosmodisk
This is the stuff most of us have to deal on daily basis. Do I just google the
it and move on, or maybe I should try to come up with it my self? Every time I
google, I don't feel any good at all.In fact,I don't feel anything.When I come
up with some solution myself,It elevates my motivation and I always learn
something new. As a manager though,I sometimes allow to do this stuff,while on
other occasions I specifically tell not to spend time on some creative stuff
and just get something off the shelve.

------
oakesm9
Here's the full source code if anyone wants to take a look:

[https://github.com/alphagov/notifications-
api](https://github.com/alphagov/notifications-api)

------
geospeck
One interesting thing that I read the other day from a lead developer at
Gov.uk was about how fast they managed to built a service within a couple of
hours[1]

What I also really like about Gov.uk is that they seem to have their apps open
source[2]

[1][https://twitter.com/RichardTowers/status/1243904365506760709](https://twitter.com/RichardTowers/status/1243904365506760709)
[2][https://github.com/alphagov](https://github.com/alphagov)

~~~
robin_reala
[https://www.gov.uk/service-manual/technology/making-
source-c...](https://www.gov.uk/service-manual/technology/making-source-code-
open-and-reusable)

When you create new source code, you must make it open so that other
developers (including those outside government) can:

\- benefit from your work and build on it

\- learn from your experiences

\- find uses for your code which you hadn’t found

------
rstuart4133
I see everyone is focusing on 3rd party providers.

Here is another solution: smstools
([https://packages.debian.org/buster/smstools](https://packages.debian.org/buster/smstools)),
and a bunch of SMS modems pluggined into a USB hob. A SMS modem can send SMS
in under 5 seconds. They say 200,000 a day max, so lets say you want to cope
with sending 200,000 in 4 hours.

That's a little under 14 per second, so lets say 15 per second. You need 75
modems[0] to do that, or about AUD$4,000 worth of modems. Sorry for the AUD$ -
I'm Australian. You will also need SIM's - or about AUD$1,500 worth. Don't
worry about having 75 modems in one spot, the mobile phone network is designed
to cope with a stadium of people all sending at the same time.

Perhaps you want to shard it for reliability - maybe 5 machines, so add
another AUD$5000 for a NUC or similar all with a minimal Debian install + a
web server or whatever for whatever delivery mechanism you are going to use to
get the SMS's to the servers. That's AUD$10.5K total. Write some glue code -
which is a week tops and job done.

The one question I'd be asking is how does that compare to using the cloud.
Third party providers charge around AUD$0.05 per SMS. They say a minimum of
100,000 SMS's per day - or AUD$150K / month. The cost for the non cloud
solution is AUD$10.5K for the first month, then $1,500 / month after that for
the pre-paid SIMS.

Downsides: when it breaks (and it will), you will have to diagnose what's
going on. That can be hard if the cause is a welding shop starting up next
door. You are also going have to deal with the telco's screwing up their SMS
infrastructure which seems to happen in Australia every 12 months or so. But
you can fight that to some extent by geographically distributing your NUC's
and using several different telco's servicing each NUC. That way it becomes
more obvious what failed. Finally, instead of NUC's use industrial rated PC
[1] to get your reliability up.

[0]
[https://www.ebay.co.uk/itm/283828240407](https://www.ebay.co.uk/itm/283828240407)
You need the version with an 'S' (for serial interface) suffix, although you
can often just change the firmware.

[1] [https://fit-iot.com/web/products/fitlet2/](https://fit-
iot.com/web/products/fitlet2/), industrial temperature rated. So, no stinking
air conditioning required. :D

~~~
ryanlol
Why not just buy a sim box instead?

~~~
rstuart4133
I've used Hypermedia’s SIM boxes. They are pretty good - neat install, didn't
break for a long while, well documented. Also very expensive what what they
did, but hey you are paying for a boxed solution, right?

Maybe when I was young. But I have grey hair now. I'm sure some of it is grey
because over the years I've made too many on off purchases of specialised
boxes promising to do everything I needed at the time. The expensive, nerve
racking disasters I've had in IT were caused by boxes like that - raid arrays
that used some "high speed" proprietary format, IBM SCSI boxes that needed
specialised IBM disks whose firmware bug happened to spray shit across the
data flowing across the SCSI bus on occasion, hell even specialised Telco
APN's they work for a few years then didn't, and after 6 months they admitted
they had fired the people who set it up.

When that Hypermedia box died (and they all do eventually), I phone up the
suppler - and a replacement was an order, payment, international freight, and
customs away. That was weeks. So we are down for weeks.

The alternative is to make do with off the shelf retail components that is
sold to ordinary punters every day. Yes, the components aren't as reliable.
You also have to provide glue - but you write the glue or use open source, so
visibility into problems is excellent and the response time is amazing. And
they are dirt cheap, so you can keep a couple of hot spares on the shelf (as
you would if you had 75 modems). Failing that getting new ones is just a case
of going down to you local retail outlet and picking one off the shelf. And
they actually have _less_ bugs, partially because there of millions of them
out there, and partially because a retail brand will be overwhelmed if their
channel fills up with failures.

I saw a misbehaving IBM SAN take out video production house once, and the 100
people who worked there. (Turns out movie length video editing pushes a SAN
very hard.) I didn't make that purchasing decision, but I may well have back
then. There but for the grace of god go I. I've came close enough as it is.

So no 128 x SIM boxes for me thanks.

For what it's worth, a one off expensive box is worth it if the purchase price
includes a man carrying the requisite spares on your door step with 8 business
hours when it breaks. Big companies like Dell, HP and IBM do that for their
big iron. (In an amazing coincidence, the boxes they are willing to cover with
that sort of service for 5 years at a moderate increment in the purchase price
almost never break.)

The other time I'm now willing to purchase specialised boxes is if they are
mostly open source (so I have visibility, and very little bespoke poorly road
tested complex code), and I'm purchasing 100's of them so I can realistically
price in keeping a bunch of them on the shelf in the initial purchase.

------
867-5309
I thought this might have explained the mechanics behind the recent GOV_UK
CORONAVIRUS ALERT, but alas. It didn't even namedrop the SMS providers

~~~
benbristow
From what I've heard (could be wrong) the government literally just said to
all the networks 'send this to everyone' and let the individual networks
handle it.

Apparently they could've set up a system like in Japan where your phone gets
emergency alerts (which I actually experienced when I visited on vacation a
few years ago on my iPhone from the UK) and are handled specifically by the
mobile operating system but the government were too cheap to set it up.

[https://www.theguardian.com/world/2020/mar/23/government-
ign...](https://www.theguardian.com/world/2020/mar/23/government-ignored-
advice-set-up-uk-emergency-alert-system)

~~~
Symbiote
I think this is just GSM Cell Broadcast [1].

It seems strange that it would be complicated or expensive to set up. I've
used cheap, aimed-at-tourist SIMs in developing countries where cell broadcast
was used to send news and adverts, which is very annoying.

[1]
[https://en.wikipedia.org/wiki/Cell_Broadcast](https://en.wikipedia.org/wiki/Cell_Broadcast)

~~~
benbristow
Which countries where they? That does sound annoying. Does it happen with
standard network SIM cards?

~~~
Symbiote
I don't know if it happens to everyone, or if I chose the worst network
provider from the booths at the airport.

I can't really remember the country, it was years / 20 countries ago. Possibly
Vietnam.

------
the_arun
What if aws was govt? -
[https://www.cloud.service.gov.uk/](https://www.cloud.service.gov.uk/)

------
toomuchtodo
Take note USDS/18F! Something to consider instead of Govdelivery.

------
SparkyMcUnicorn
What's interesting is that this looks like a service anyone can use. API and
everything.

[https://www.notifications.service.gov.uk/](https://www.notifications.service.gov.uk/)

~~~
daguar
It’s open source, and Australia and Canada have both deployed instances. Would
love to see a US (or state run) instance.

~~~
anticensor
United States Alert Message Service as a branch of USPS?

~~~
toomuchtodo
You’d want it under the authority of GSA (like Login.gov), as a foundational
service (messaging.gov).

------
jimmySixDOF
This topic reminds me of the story in Hawaii when they sent out a broadcast
false alarm SMS alert to everyone about North Korean missiles

If I remember correctly it was due to a poorly designed drop down menu &
missing confirmation challenge box. "Test Send" and "Send Send" were right on
top of each other in a 10point font lol.

[1]
[https://en.wikipedia.org/wiki/2018_Hawaii_false_missile_aler...](https://en.wikipedia.org/wiki/2018_Hawaii_false_missile_alert)

------
Angostura
Before we get too carried away; its worth seeing how the system coped with
unprecedented pressure:

[https://www.bbc.co.uk/news/technology-52037573](https://www.bbc.co.uk/news/technology-52037573)

"Millions of mobile users in the UK have yet to receive the government's text
message alert about coronavirus. The SMS - telling people to stay at home -
began being sent early on Tuesday morning. But Vodafone has confirmed it only
expects to complete the process later this Wednesday...."

~~~
ddddddj
Just in case it wasn't clear from discussion elsewhere on this post, Notify
didn't send out the text message mentioned in the news. That message was sent
directly by the networks at request of the government.

------
ohlookabird
It is a really useful service! I used their email service to get updates on
potential travel destinations in the past months when Covid-19 started to be a
thing.

------
blntechie
Not to diminish anything but 100 to 200k messages per day is not really huge
in relative scale of things. Several private services and government orgs in
China and India easily send 10x or more of that number.

~~~
SparkyMcUnicorn
They never said 100-200k was the limit, but rather that's how many they're
usually sending out per day.

It could be a limit, but I don't see anything that indicates there's a
correlation.

~~~
petepete
Definitely not a limit, on March 28th they sent 637k SMS messages.

[https://www.gov.uk/performance/govuk-
notify](https://www.gov.uk/performance/govuk-notify)

I've used Notify on several projects over the last couple of years, it's a
really nice service and has never caused us any problem whosoever.

------
CaciaraAsAServi
Is there a particular reason why, judging from the graphs, the rate suddenly
drops every 6 minutes or so? Batching or something like it?

~~~
ddddddj
Hey, I'm the person who wrote the blog. I was slightly interested in the
pattern too but didn't take the time to look into it. I assumed it was either
something to do with a service sending us traffic in that pattern or maybe
just something to do with Grafana. It could also be some unexplained behaviour
in our system, maybe something to do with how we are pulling items off the
queue. If I find some time next week I might take a proper look though!

~~~
danpalmer
If you’re using Prometheus and restarting servers regularly (a default
behaviour for gunicorn for example) then you’ll lose a little data, up to your
Prometheus scrape period, ever restart.

My team found this pretty annoying for monitoring a Django site so we’ve ended
up moving to a statsd push-based metrics approach and are finding numbers
generally easier to trust and reason about.

------
harel
As a user of Notify, I can vouch for the reliability of the service. It's rock
solid.

~~~
robga
It is interesting the article didn’t mention the incident 2 weeks ago when it
had problems with a 7-fold increase in volume. Though it did well under the
strain.
[https://status.notifications.service.gov.uk/incidents/jpwxyt...](https://status.notifications.service.gov.uk/incidents/jpwxytmst2tx)

~~~
ddddddj
Hey, I'm the one who wrote the blog post. This was originally a talk a
colleague on my team gave internally about 2 months ago and then I wrote it up
as a blog post a few weeks back, before we had that big incident so there
wasn't any particular thought on not mentioning it.

As the postmortem mentioned, it wasn't related to any of this load balancing
work or our providers, it was us running into trouble with a different part of
our system. That was a busy week (both in terms of numbers as you can see on
[https://www.gov.uk/performance/govuk-notify/notifications-
by...](https://www.gov.uk/performance/govuk-notify/notifications-by-type) but
also in terms of us fighting fires).

------
londons_explore
I'm not sure I'd advocate for this design... The traffic sharing and backoff
seems crude... The "10% per minute" doesn't prevent sudden increases in load
killing both providers.

I would design it like this:

Put all requests into a distributed queue, for example persistent pubsub.

Have workers take work from that queue.

Each worker should send a new request to a provider if the rate of requests
sent in the past minute is < 2x the rate of requests in the previous minute,
and the number of _in flight_ requests is < 10x the average of the past
minute, and the rate of errors, including timeouts, is <1%. If both providers
are eligible, send to whichever has had the fewest requests in 24h.

This prevents flooding/DoSing a badly configured provider (a well configured
provider would have ingress ratelimiting, and you could do away with all the
above logic).

Have alerting on the age of the oldest item in the queue, and a monitoring
dashboard showing dispatch rate to each provider, with response error codes.

All the state is local to the worker, and doesn't need persisting. If a worker
crashes, the item doesn't get acknowledged to pubsub, and will be retried. If
you like, you can autoscale the number of workers based on their cpu
utilization.

I'd expect the above to scale to 10k qps per worker, and 5Mqps for 1000
workers before needing a redesign.

