
Ways to reduce the costs of an HTTP(S) API on AWS - praveenscience
https://gameanalytics.com/blog/reduce-costs-https-api-aws.html
======
georgyo
I run a small service, ifconfig.io, that is now getting 200 million hits a day
from around the world.

The response from it is about as small as you could make it, however at that
volume it is about 150gb a day.

If I hosted this on AWS, the bandwidth alone without any compute would cost
$900 a month. Prohibitively expensive for a service I just made for fun.

The cost of just sending the HTTP response headers alone is the majority of
that cost to. There is no way to shrink it.

It is currently hosted on a single $40 linode instance and can easily keep up
with the ~2400 sustained QPS. I think it can get up to about 50% more traffic
before I have to scale it. And linode includes enough bandwidth with that
compute to support the service without extra costs.

I don't see how anyone pays the bandwidth ransom that GCP and AWS charge.

~~~
blantonl
Well, to be fair, most of us who are paying the "bandwidth ransom" _do have to
scale_ , quite significantly I might add, and so the value is the platform as
a whole.

Furthermore, if you are doing something for fun like you are, the bandwidth
ransom definitely comes into play for elastic cloud environments, but anyone
doing anything significant on AWS/GCP has definitely already negotiated down
their bandwidth spend with their AWS/GCP account management team.

~~~
georgyo
At large scales, decisions need to get made. AWS and GCP will not negotiate
with you unless you're big enough to make any if that worth their time.

Netflix is a great example. They run most of their services on AWS. But they
also run their own CDN with real hardware in data centers because serving it
from Amazon would be a deal breaker.

There are reasons to use AWS and GCP. But when I start a project, I don't
start there. It's too expensive one way or another, and the "free" tier gets
blown out extremely quickly.

A smaller provider will provide what you need, normally be cheaper, and has no
lock in. If you later decide that you really want autoscaling or managed
databases then you can move easily. And if you do switch, you'll at least know
what your product even wants to be, and it's projected growth.

------
juliansimioni
This is a good list of ways to reduce outgoing bandwidth costs, but as someone
who has switched from backend developer to running a small business, I can't
help but notice that they don't talk at all about whether any of their cost
savings were meaningful to the business.

Sure, it looks like they saved about $2000/month, but consider that those
savings probably won't even pay for more than a quarter of a one of their
developers.

Even though their service is free (their parent company gets business value
from the aggregate analytics they obtain through their service), it very
possible that there's something they could have done to bring more value to
their parent company than the money they saved here.

Maybe it's unreasonable to expect a company to talk about that in a blog post,
but it left me wondering.

~~~
markonen
My read was that they actually saved over $8000 per month:

\- They mention that the initial savings of $1500/mo from omitting unnecessary
headers was 12% of their egress cost (so the total before this was $12500)

\- Then they got an additional 8% of savings by increasing the ALB idle
connection timeout to 10 minutes (down to $10120)

\- Finally they said they saved $200 per day by switching to a lighter TLS
certificate chain ($6000/mo, so down to $4120)

None of those steps seem to have required any meaningful amount of development
work. Let's say this took a developer one week? The return on that effort
would be $100k a year, or $2500/hour for the first year alone.

~~~
blazespin
Considering they have enumerated this for others to pick up and execute
quickly, they may have just saved the wider industry potentially 100s of
thousands per month.

Give and take is an open source attitude. It doesn’t always have to be about
source code, sometimes it can be about cost savings techniques such as this.

------
alex_young
All great ideas.

Another suggestion:

Terminate somewhere else.

If you fit inside of the CloudFlare T&Cs, you can probably save a much larger
amount terminating there and having them peer with you using the same TLS
every time, or failing that, try someone like BunnyCDN.

I've found that while AWS CloudFront is easy to instrument, it's neither very
performant (lots of cache misses even when well configured), or cost effective
(very high per byte cost).

~~~
StavrosK
> terminating there and having them peer with you using the same TLS every
> time

Can you elaborate for someone who isn't that familiar with networking? How
does this work?

~~~
Taik
This is basically saying, use a 3rd party CDN (e.g. Cloudflare) to handle and
terminate client connections, letting the CDN pipeline the actual requests
through a handful of persistent connections to your server.

~~~
StavrosK
Ah, I see, thank you. So this is just to avoid TLS negotiation every time.

------
iconara
This was a great read.

We went through something similar a couple of years ago, when TLS wasn't as
pervasive as it is today and at first focused mostly on minimising the
response size – we were already using 204 No Content, but just like the OP we
had headers we didn't need to send. In the end we deployed a custom compiled
nginx that responded with "204 B" instead of "204 No Content" to shave off a
few more bytes. It turned out none of the clients we tested with cared about
the string part of the status, just that there was a string part.

When TLS started to become more common we realised the same thing as the OP,
that the certificates we had were unnecessarily large and costed us a lot, so
we switched to another vendor. When ACM came we were initially excited for the
convenience it offered, but took a quick look but decided it would be too
expensive to use for that part of our product.

------
chrismeller
I was honestly expecting some kind of meh article that said to reduce headers,
enable compression and other basic stuff. I was pleasantly surprised that
wasn’t the case... and absolutely astounded that the handshake provided that
much of a difference, it was the last thing I would have thought of.

------
maxkuzmins
At such a high volume of requests it probably makes sense to consider going
one abstraction level lower by replacing HTTPS with plain SSL sockets based
communication for further cost reduction.

Nice deep dive into the S of HTTPS anyway.

~~~
bureaucrat
Or just encrypt in-house and use HTTP.

~~~
maxkuzmins
These guys receive requests from mobile devices. Afaik sending unencrypted
HTTP requests is not allowed on some platforms (e.g. iOS).

~~~
throw03172019
You can disable transport security for a certain domain using the plist key
NSAppTransportSecurity. But you rarely should do this :)

[https://developer.apple.com/library/ios/documentation/Genera...](https://developer.apple.com/library/ios/documentation/General/Reference/InfoPlistKeyReference/Articles/CocoaKeys.html#//apple_ref/doc/uid/TP40009251-SW33)

------
tumetab1
> Also, the certificate contains lengthy URLs for CRL download locations and
> OCSP responders, 164 bytes in total.

If you're going on that path It's probably best to avoid revocation
altogether, since it doesn't really work, and go the let's encrypt way,
certificates with lower lifespans.

On that scale a 15 days cert on rotation is probably fine.

~~~
mhenoch
That's a good point. Seems like Let's Encrypt certificates contain an OCSP URL
but no CRL URL, so they are a bit smaller.

------
SlowRobotAhead
> We’re currently using an RSA certificate with a 2048-bit public key. We
> could try switching to an ECC certificate with a 256-bit key instead

Having just ruled out RSA on an embedded project for exactly this reason,
definitely the first thing that came to mind.

If they’re getting down to the byte differences, under their additional
options, they really should have had binary serialized data instead of JSON.
Something like CBOR “can” near immediate conversion to JSON but it would mean
an update to all of their end points and they might not be feasible but could
be worked in for new projects over time.

~~~
namibj
I'm sad about the state of support for ed25519/curve25519 crypto in TLS.

If you could reasonably deploy a website that doesn't offer anything else for
https, you'd instantly fix many session establishment-based CPU DoS attacks.
It's multiple times faster than what you usually allow your server to
negotiate.

------
bandris
Perhaps AWS Certificate Manager certificates are deliberately large so more
outgoing traffic can be charged?

Interesting idea from the post: "it could be a selling point for a Certificate
Authority to use URLs that are as short as possible"

~~~
jrockway
I doubt it. AWS's certs are just another three-quarters baked AWS feature.
They did the best they could with the resources they had.

At my last job we had a fun and exciting outage when AWS simply didn't auto-
renew our certificate. We were given no warning that anything was broken, and
it apparently began the internal renewal process at the exact instant the cert
expired (rather than 30 days in advance as is common with ACME-based renewal).
Ultimately the root cause was that some DNS record in Route 53 went missing,
and that silently prevents certificate renewal.

We switched TLS termination from the load balancer to Envoy + cert-manager and
the results were much better. You also get HTTP/2 out of the deal. We also
wrote a thing that fetches every https host and makes sure the certificate
works, and fed the expiration times in prometheus to actually be alerted when
rotation is broken. Both are features Amazon should support out of the box for
the $20/month + $$/gigabyte you pay them for a TLS-terminating load balancer.
Both are features Amazon says "you'll pay us anyway" to, and they're right.

~~~
ti_ranger
> it apparently began the internal renewal process at the exact instant the
> cert expired (rather than 30 days in advance as is common with ACME-based
> renewal).

Was this some time ago?

The FAQ for ACM ([https://aws.amazon.com/certificate-
manager/faqs/](https://aws.amazon.com/certificate-manager/faqs/) ) says:

> Q: When does ACM renew certificates? > > ACM begins the renewal process up
> to 60 days prior to the certificate’s expiration date. The validity period
> for ACM certificates is currently 13 months. Refer to the ACM User Guide for
> more information about managed renewal.

> We switched TLS termination from the load balancer to Envoy + cert-manager
> and the results were much better. You also get HTTP/2 out of the deal. We
> also wrote a thing that fetches every https host and makes sure the
> certificate works, and fed the expiration times in prometheus to actually be
> alerted when rotation is broken. Both are features Amazon should support out
> of the box for the $20/month + $$/gigabyte you pay them for a TLS-
> terminating load balancer.

You're implying that AWS doesn't support HTTP/2 on any load-balancers they
offer, but ALB has supported HTTP/2 since launch (
[https://aws.amazon.com/blogs/aws/new-aws-application-load-
ba...](https://aws.amazon.com/blogs/aws/new-aws-application-load-balancer/) )
3 years ago.

I don't see any current load-balancer priced at $20/month (ALB, NLB and
Classic ELB are all ~ $8/month), so I can't guess which one you were using
here ...

~~~
jrockway
I have no memory of when this was but it was on the order of 9 months to a
year ago.

"up to 60 days before" includes "five minutes after". What it excludes is the
renewal starting 61 days before the cert expires, and, as documented, it sure
didn't do that.

Stuff went wrong and we had no observability. That is the AWS way.

~~~
sciurus
Not to be mean, but you definitely had observability into the expiration date
of your certificate. You just weren't monitoring it yet. What you are doing
now with Prometheus sounds good.

~~~
toast0
If you need to figure out for yourself what to monitor about the service,
including things AWS says it handles, it brings into question the value of the
service.

------
rlastres
Funny enough, Amazon.com uses a Digicert certificate similar to the one
mentioned on the article, they don't seem to use the ones they provide for
free on AWS :slightly_smiling_face:

~~~
yandie
Big surprise. Contrary to the popular belief, AWS wasn't/isn't built to
support Amazon.com. Some fundamental pieces are designed for Amazon.com scale,
but most other services are not (ACM in this case)

~~~
Dunedan
Amazon.com uses a lot of AWS services. They even write about it:
[https://aws.amazon.com/de/blogs/aws/amazon-prime-
day-2019-po...](https://aws.amazon.com/de/blogs/aws/amazon-prime-
day-2019-powered-by-aws/)

Of course it's true that they don't use all AWS services, either because they
don't need them or because they had something built in house earlier which
works for them.

------
chrissnell
Didn't see it mentioned: SSL tickets. If you were running a NLB and nginx in a
pool of instances, you can use an Openresty-based inplementation of SSL
tickets to dramatically speed up negotiation of reconnecting clients. You will
need a Redis server to store the rotating ticket keys but that's easy with AWS
Elasticache. You will also need to generate the random keys every so often and
store them in Redis, removing the oldest ones as you do. This is a task that I
accomplished by writing a small Go service.

If you serve a latency-critical service, tickets are a must.

~~~
arkadiyt
> Didn't see it mentioned: SSL tickets

They do talk about it, SSL tickets and TLS session resumption are referring to
the same thing.

------
rlastres
I guess this might be specially relevant for traffic patterns similar to the
one described in the article, for other use cases most likely those
optimisations will not translate into big savings

------
devit
How about the obvious solution of not having ANY data transfer out?

Encrypt and sign the data via NaCL or similar, send via UDP duplicated 5-10
times, no response at all from the server (it's analytics, it doesn't matter
if very few events are lost and you can even estimate the rate).

As for the REST API, deprecate it and if still needed place it on whatever 3
VPS services have the lowest costs, and use low TTL DNS round-robin with
something removing from DNS hosts that are down.

------
coleca
Fascinating article. I love posts with this type of in-depth investigation
into what everyone else would just pass over and not even think about.

It's not surprising that it's related to the gaming industry. Some of the best
AWS re:Invent videos I've seen are in the GAM (gaming) track. Even though I've
never worked in that field, the problems they get hit with and are solving
often are very relevant to any high-traffic site. Because of the extreme
volume and spikiness of gaming workloads, they tend to find a lot of edge
cases, gotchas, and what I'll call anti-best practices (situations where the
"best practice" turns out to be an anti-pattern for one reason or another,
typically cost).

------
ajbeach22
I wonder what the cost is compared to terminating SSL at Clodfront? For my web
tier architectures, I use Cloudfront to reverse proxy both dynamic content
(from the api) and static content (from s3). SSL is terminated only at
CloudFront.

~~~
ball_biscuit
I don't think you can use Cloudfront to serve that kind of traffic. Cloudfront
costs are described here:
[https://aws.amazon.com/cloudfront/pricing/](https://aws.amazon.com/cloudfront/pricing/)

So for 10k HTTPS requests, the price is 0.01 $. If you serve 5 billion per
day, that is 5000$ a day. With such high traffic I believe it is needed to
handle it using performant webservers (Go, Erlang?) to keep costs reasonable,
and probably terminating SSL at load balancer is the way to go

~~~
ajbeach22
I am not sure that math is right. Using the aws cost calculator, its only
about 1100/mo for 5B https requests. However, I think if you consider data
transfer its still probably in the range of a several thousand a day. yikes.

~~~
Dunedan
Not sure what calculator you're using, but from the pricing page [1] it's
pretty clear that 5B HTTPS requests cost at least (depending on the geographic
origin) $5000. And that's per day and without data transfer.

[1]:
[https://aws.amazon.com/cloudfront/pricing/](https://aws.amazon.com/cloudfront/pricing/)

------
synunlimited
You could also look into using brotli compression over gzip for some more
savings of bytes over the wire.

~~~
Ayesh
Brotli support in API clients are quite low. I run a small API service, and
you'd be lucky to see API clients even using gzip.

~~~
synunlimited
True, though given its relatively easy to support you could get some savings
in the few cases that do use it.

Also if they own some/all of the SDK's that are used for hitting their API
they could bake in brotli compression at that level.

------
meritt
This is an awesome article but if your egress costs are so high that you're
deciding which HTTP headers to exclude, you should probably be moving to an
unmetered bandwidth provider, or at least one that charges a reasonable amount
for egress.

~~~
caymanjim
Is there any such thing? I don't know of any cloud service provider that
offers unlimited bandwidth. There are very few providers who could handle five
billion connections per day in the first place, regardless of bandwidth.

~~~
meritt
5B requests/day is ~60k/second, that's big but nothing insane. There are
numerous frameworks/setups that can do _far_ more than that on a single
machine [1]

popular unmetered options: he.net, ovh, hetzner - You generally lose a lot of
the "cloud" capability with these options however.

cloud options: digital ocean egress is $0.01/GB ($0.005/GB if you buy it via
droplets), linode is $0.02/GB, vultr is $0.01/GB, etc.

[1]
[https://www.techempower.com/benchmarks/#section=data-r18&hw=...](https://www.techempower.com/benchmarks/#section=data-r18&hw=ph&test=db)

~~~
all_blue_chucks
Unmetered connections are only unmetered until you cost them more than you're
paying. They they throttle you or boot you. Nothing is free.

~~~
meritt
I'm talking about actual unmetered where you pay for a dedicated amount of
bandwidth, e.g. 1 Gbps / 10 Gbps / 20 Gbps. 10 Gbps usually goes for about
$1k-$2k/mo in the US. This is how colo facilities have operated for decades.

10 Gbps fully saturated delivers about 3300TB for that $1-2k/mo, versus the
$22k/mo you'd pay AWS for the same.

I'm absolutely not talking about the "unlimited bandwdith" bullshit that
discount hosts offer.

~~~
all_blue_chucks
If your project gets featured on CNN and your bandwidth goes up 20x can these
colo arrangements automatically scale up your dedicated bandwidth? I ask
because having an outage when you get your first big break can cost you WAY
more than your bandwidth bill ever would...

------
tyingq
Maybe also consider caching API responses in a cheaper non-AWS CDN where
possible. APIs like "zip code to list of cities" where the output is the same
for all users, and doesn't change often.

------
nimish
Switch to ecdsa certs and shave another few hundred bytes :)

Bandwidth is the killer thing with aws. It's designed to make you move
services inside the boundary.

------
pragnesh
"accept-encoding: gzip" header is request header. why it is present in
unoptimized response in first place ?

~~~
mhenoch
It was added as part of a bug fix five years ago: the server was looking at
the Content-Type request header instead of Content-Encoding to determine
whether the incoming payload was compressed. Not sure why the Accept-Encoding
response header was added as the same time, but it went undetected since it
didn't cause any problems (apart from costing money).

------
nartz
How about a UDP endpoint?

------
pragnesh
Using protobuf or flatbuffers also reduce payload size.

~~~
rlastres
protobuffs is an option that could work for SDKs, but the API is also a public
documented REST one: [https://gameanalytics.com/docs/item/rest-api-
doc](https://gameanalytics.com/docs/item/rest-api-doc). Also, in the responses
could be possible to just not include a body, and AWS does not charge for data
transfer in, so the size of the request JSON is not relevant for the cost

------
kaos19870
wow, such a small change to your HTTP Headers can save you that much?

~~~
blantonl
If you are running 5 billion daily requests where your outgoing response size
is significantly less than the aggregate of the size of the headers, then yes.

Also, the article clearly articulates that the answer is, yes.

------
bullen
"If the clients use HTTP/2, data transfer decreases further, as response
headers are compressed."

But CPU usage is increased for decompression and CPU is the only real
bottleneck.

Just because you don't pay for the compression electricity doesn't mean you
get away with it.

This ties back to my previous comment on the User-Agent subject yesterday,
remove all headers except "Host" from all HTTP traffic is the solution.

HTTPS is a complete waste of energy. Security should not be overarching, it
should be precision.

WebSockets are also bad, since they don't work well with memory latency. Use
"Transfer-Encoding: chunked" on a separate pull connection instead.

~~~
toast0
Electric use for the client on compressed vs not compressed isn't as clear cut
as more/less cpu. You also need to consider the reduction in use of the
network interface, since the data size will be smaller. Overall latency could
improve as well if the compressed form is meaningfully smaller (depending on
the tcp congestion window, just one packet smaller can mean a whole roundtrip
time)

~~~
bullen
No, that's not how it works, you cannot upgrade the routers in real-time
without complexity and additional cost. So the cost for transfer is fixed with
more latency. But if you subtract bad protocol design and the latency added by
the compression/decompression I'm pretty sure you end up with the same deal
just more complexity that costs even if you don't see the costs.

Just like wind-power actually competes with nuclear because it take 30 days to
wind down a nuclear power plant.

Also data can be compressed with more efficient hardware on the backbone
without you having to deal with it.

The biggest cost of the internet is idle things and synchronized CPUs, async.
never made it unfortunately.

