Hacker News new | comments | ask | show | jobs | submit login
TLS Termination for Network Load Balancers (amazon.com)
117 points by el_duderino 26 days ago | hide | past | web | favorite | 60 comments

They don't mention it in the article, but TLS comes at an additional cost. Same price per hour and per LCU, but an LCU gets you less with TLS. Standard LCU:

- 800 new non-TLS connections or flows per second.

- 100,000 active non-TLS connections or flows (sampled per minute).

- 1 GB per hour for EC2 instances, containers and IP addresses as targets.


- 50 new TLS connections or flows per second.

- 3,000 active TLS connections or flows (sampled per minute).

- 1 GB per hour for EC2 instances, containers and IP addresses as targets.


Just because I found it interesting:

- 16x cost for new connections or flows per second.

- 32x cost for active connections or flows.

- 1x cost for traffic carried in GB per hour, which you'd expect.

That's still double the connections compared to having an ALB which only gives you 25/s/LCU. Plus an NLB is 0.6 cents/hour/LCU in us-east-1 vs 0.8 for an ALB.

There is a great mini-thread by a principal engineer at AWS describing why this is so amazing: https://twitter.com/colmmacc/status/1088510453767000064

I was pretty excited until I saw the screenshot where you would have have to choose the certificate.

If you run a multi-tenant system that services N "vanity" domains (where N is around the same number as your users on a $5/month plan), there is still no service in AWS to do transparent TLS termination for a reasonable cost. Which is a pity, since it really costs almost nothing to generate these certificates.

Completely agree. I have the same issue with Google Cloud and DigitalOcean. This is a real need for SaaS providers. A wildcard customer-name.my-saas.com is not enough. Many customers want something like service-name.customer-name.com.

We use wildcard certs (*.example.com) so each customer can have vanity-customer-name.example.com domains. I think our model is fairly common for multi-tenant domain name segregated systems.

We do that as well for the cheaper packages. But serious users insist on a our-service.client.com "vanity" scheme, which is easy enough with CNAME records, apart from TLS that adds significant cost.

For example, Application Load Balancers (ALB) come with a limit of max 25 certificates, which is non-starter for us. So you cannot avoid terminating your TLS in nginx/caddy etc (it also needs to be HA of course) and then hit another LB that stands in front of the actual service. You end up with a 2-layer LB architecture that adds cost and complexity.

Can you bundle your certificates the way Cloudflare does? IIRC if you connect to a CF endpoint you will get a certificate with N arbitrary hostnames that are being served by that endpoint. I don't know what the SNI hostname limit is, but it's probably stupidly high. Multiply that by 25 and it may be tenable?

You really want to avoid putting too many SANs on a certificate as i) renewal breaks much more frequently; ii) certificate size increases/negatively affects performance due to fragmentation and; iii) browsers many lock up (if you like to live dangerously, try loading https://10000-sans.badssl.com/ for example).

The biggest operational headache by far is on renewals. If one customer on a commingled certificate adds a CAA record that doesn't include your CA of choice (or more commonly, the customer churns and no longer points to you), your renewal fails. If you're running a SaaS business you really want one certificate per hostname that's lazy loaded based on the incoming SNI. This keeps renewal failures and support costs down, and keeps customers from seeing a list of their competitors on their certificate.

As @elithar points out, that's why we built SSL for SaaS. You make a single API call with the name of the hostname you want a certificate issued for, indicate the validation method (HTTP ./well-known is by far the "happiest path"), and then in about a minute you've got a certificate deployed worldwide. You tell us where to route traffic back to by providing a default origin (which can be a load balancer) and optionally overriding it on a per-hostname basis.

So long as your customer is CNAME'ing to your domain and that domain resolves to Cloudflare's edge, we can automatically complete—and keep completing—domain control validation (DCV). We then issue two certificates per hostname: one P-256 keyed, SHA-2/ECDSA signed certificate that gets presented to modern browsers[1] and one RSA 2048-bit, SHA-2/RSA signed certificate that gets presented to browsers that don't support ECC.

1 - https://blog.cloudflare.com/tls-certificate-optimization-tec...

Issuing SAN-s with multiple tenants on the same certificate, even if the substantial technical problems were overcame, would make our clients reach for pitchforks and torches.

> You make a single API call with the name of the hostname you want a certificate issued for ... [magic TLS things happen]

This, precisely, is how the ALB _should_ work, without a silly 25-certificate limitation. Sounds like an excellent service.

I wonder what is the cost of issuance for a TLS certificate. There is minimal storage/network load, some computations (likely in hardware). Perhaps the main cost is the collection of sufficient entropy?

Cloudflare has a service that does this automatically - they call it "SSL for SaaS" - and you can provision custom/vanity hostnames onto their own certificates within a minute or so + push to the edge.

You should reach out to https://twitter.com/prdonahue (PM), who can share the details.

(Used to work there, now at GCP)

Use multi-SAN certificates. Yes, it means revealing who is using your service, but that's for you and your customers to weigh against the costs.

This is what I do, but that's a lot of accidental complexity.

Here is what Caddy's author wrote about this:

> I still don't like the idea of SAN certificates. Too much room for error... what if you go to renew a SAN and one of the domains fail, the other 99 don't get renewed either. (Sure, we can code in logic to make a different certificate with 99 names, but that gets complicated quickly.) Also, we'd need a database as we have a many-to-many relationship rather than 1:1 which is much easier.

Source: https://github.com/mholt/caddy/issues/831#issuecomment-22011...

if your domains are sub.domain.com, sub2.domain.com you could use a wildcard *.domain.com certificate.

The problem with this is that you have to complete domain control validation at the apex of the domain, not just on each hostname.

If you do per hostname, you can ask a CA to send a validation request to http://sub.domain.com/.well-known/pki-validation/... and by nature of that hostname CNAME'ing to your domain (which they'll need to do anyway), you can complete it.

If you do at the apex, you'll need to complete DCV out of band, e.g., by having an email sent or adding a separate DNS record above/beyond what's being used to point to you.

> This will free your backend servers from the compute-intensive work of encrypting and decrypting all of your traffic


That’s great news.

Does anybody know if they support ALPN to announce h2 as the http2 protocol?

They don’t do this for ELBs, I would hope for NLBs.

If they do this, this would finally be the sane solution how to use NLBs + ACM to expose gRPC services at the edge.

Other solution might maybe be (as popping up here in the threads):

Use NLB + ACM and let it do MITM and then forward the traffic to a service with a self signed certificate. The hope here would be that the NLB would pick up the ALPN header from your backend service and communicate it transparently to the outside world. Don’t know if this architecture makes sense.

Anyways, I just would love gRPC capable secure workloads on the NLB with ACM.

Doing Self signed certificates or LetsEncrypt works already - the Gamechanger is to use ACM + ALPN and by that also having upstream http2 traffic behind the NLB (compared to the ALB, which only supports upstream http1)

Unfortunately, this still wouldn’t have solved my biggest pain point at the last company I worked for. You can’t use TLS termination at the load balancer when you require HIPAA Compliance.

If they had some kind of way to use ACM generated SSL certs on the VM and had a method for autorenewal that would be ideal.

Encryption is an "addressable implementation" in the final Security Rule [1]. Practically this means you are not required to encrypt your data (e.g., between LB and VM) "if the entity decides that the addressable implementation specification is not reasonable and appropriate [...]".


And also it looks like it supports encrypted traffic between the LB and the instance as shown in the guide.

AWS LB does not validate backend certificates, so you can put a self-signed cert on the instance. Heck, even if the cert expires it will still work, and make LB<->EC2 connection technically encrypted. Yay compliance.

You can actually turn this on (for classic ELBs), but it locks to a specific cert (rather than a CA). So yeah no one cares about expiry but the backend does have to present that cert. The thing to look for is "Enable backend authentication".

I would question whether this is a problem though; basically if someone is in a position to MITM traffic in your AWS VPC this would indicate a compromise of AWS at a fundamental level (or loss of your AWS control plane).

AWS does not encrypt internal traffic, including traffic between Availability Zones. AZ's are spread over several datacenters, so your VPC traffic (RDS/microservicers etc) travels unencrypted across multiple physical locations. I consider AWS network assurances sufficient so it's not a problem for our standard threat model, but the auditors got their checkboxes to tick...

Do you have a citation that data travels among data centers in aws unencrypted? At least between regions, it is encrypted: https://aws.amazon.com/blogs/aws/new-almost-inter-region-vpc...

In case this sounds bad for security reasons (using self-signed certs) keep in mind that the VPC network does not allow clients to spoof IP addresses or receive traffic destined for any other MAC/IP pair. So, there is no need to validate the authenticity of a host on the network because it has already been validated by the VPC software defined network, and spoofing MAC/IP, or ARP poisoning, or any of the other traditional physical network layer attacks just don't work.

However, it is required by AWS when you sign their BAA to have HIPAA compliance.

The approach I looked at to solving this problem was using an Envoy proxy sidecar.

SSL added and removed here :)

> After choosing the certificate and the policy, I click Next:Configure Routing. I can choose the communication protocol (TCP or TLS) that will be used between my NLB and my targets. If I choose TLS, communication is encrypted; this allows you to make use of complete end-to-end encryption in transit

Wait. That is not what "end to end" means. This is more like "piecewise end-to-end", and it is not the same thing as "end to end".

Now, people are going to say that "it doesn't matter since the only party you're disclosing the comms to is AWS, who you already implicitly depend on because you're running all your stuff under their hypervision". That's true. (ofc now there are 2 places in AWS's infra where the comms are in plaintext, so now an adversary has two teams of engineers/opsen to social engineer / compromise, and will win if they break either one).

I'm reacting to the dilution of the term "end to end" here, because it's a vital concept.

This further reinforces the pattern of terminating TLS at the LB. While this is generally justifiable, it does decrease defense-in-depth.

After choosing the certificate and the policy, I click Next:Configure Routing. I can choose the communication protocol (TCP or TLS) that will be used between my NLB and my targets. If I choose TLS, communication is encrypted; this allows you to make use of complete end-to-end encryption in transit:

You can still communicate over ssl internally. What do you mean with it will decreases the defence?

But then what’s the purpose of terminating at the LB? The two advantages are that it takes the load off of your web server having to decrypt traffic and autorenwal of SSL certs. Once you have to manage your own certs and decrypt at your server, both of those advantages are removed.

Having the public facing certificate managed by acm and not worrying about deploying those on the instances is a win. I'm not sure how the certificate validation is done LB <-> instance, but if you can have a private CA and use that for intra-communication it would still keep everything encrypted properly

Auto generated self signed certs on the servers

What’s the threat model for someone intercepting traffic between a load balancer and an EC2 instance?

Please see this tweet from an AWS principal engineer:


"... well, NLB runs on Amazon VPC. On VPC we encapsulate, authenticate and secure traffic at the packet level. Packets can't be spoofed or MITMd on VPC. Traffic only goes where you send it. That makes it possible to use a self-signed, or even expired, certificate on your end."

That is generally AWS stance on the matter too; VPC is considered secure enough. I remember reading something about it in their own blog, but now I could only find this where it is explained:


I’ve watched the reinvent videos where they describe the custom hardware NICs they use with their own custom ARM chips to ensure security and that traffic isn’t spoofed.

I'd like to know too. This seems like a case of being on the other side of the airtight hatchway.


What's the threat vector if it's not a cloud environment? Also if you have access to either the LB or host wouldn't you get access to the certificate there anyway to decrypt the traffic?

I agree. Realistically, it’s just as safe to do ssl termination at the LB. However, it’s generally interpreted that to meet certain security compliance’s like HIPAA you have to have end to end encryption and encryption at rest.

Someone posted a quoted from the HIPAA regulations where it could be interpreted as not being the case. I don’t think I would risk taking that chance though from a compliance standpoint.

So AWS didn't have TLS termination until now?

I'm still waiting for GCP to support more than ten (!) certs per load balancer, which seems like a ridiculous limit when you consider the need to serve dozens or hundreds of low-traffic customer domains on a single domain. If you're on Kubernetes, they're basically asking you to split your ingresses up just to avoid the limit.

We ended up going straight to running our own in-cluster load balancer (Traefik, which is meh, but works okay) with Let's Encrypt so we don't need to provision anything at all. It's so much nicer than fiddling with manual cert registration. I really wish cloud providers such as Google would get on board with Let's Encrypt already.

They have supported TLS termination for https connections for a long time. This adds support for TLS on TCP connections. The NLB load balancer also allows you to keep the same set of pubic IP addresses for your load balancer endpoints.

We use cert-manager to get letsencrypt certs in GKE, and automatically push them to the Google global load balacer's.


Doesn't that still require that you create a maximum of 10 host rules per ingress? That makes it harder to automate things.

With the current system, we can use one consistent IP or CNAME for all new domains. With the 10-certs-per-GLB limitation, we'd have to manage the DNS accordingly so that all the domains are correctly spread out among the N different ingress GLBs.

At the time we started with Kubernetes, cert-manager was listed as alpha quality and not recommended for production. Even today, it seems Gandi (which we use for DNS) still isn't supported; at least it's not in the documented list [1]. LEGO supports Gandi, so I'm not sure if maybe the documentation is wrong here.

[1] https://docs.cert-manager.io/en/latest/reference/issuers/acm...

I don't think this limitation applies when using Kubernetes with cert-manager. When you create Kubernetes Service API objects of type LoadBalancer, you get (by default) get a TCP load balancer on GCP.

SSL termination becomes the responsibility of the cluster. The certificate (and private key) is stored inside the cluster, too, via secrets. To have cert-manager automatically create and renew certificates, all you have to do is update your ingress host/tls YAML configuration.

Using cert-manager, you can continue exposing your app under many domain names with a single IP (via an A record) or abstractly through a CNAME record.

I still haven't added more than 10 domains to manually verify your concern, however. But Kubernetes isn't making any changes to my load balancer as I've added domains. I would re-visit to see if cert-manager is right for your use case: I noticed no mention of it being alpha quality on the README. However, they do point out it is 0.x and there could be breaking changes to the API later.

So the method you describe ends up using a TCP GLB, but the point of using an HTTP GLB is to enjoy all the benefits of the GLB.

With a HTTP GLB, you get a very cheap distributed, effectively global CDN that doesn't require special configuration. You also get features like health checking, logging (which you can pipe to BigQuery), Google's "just works out of the box" edge caching (negating the need for an external caching CDN like Fastly), and so on.

My understanding is that the typical use case is to run cert-manager together with ingresses, where cert-manager's will allocae certs and create Kubernetes secrets for each. If you wire up ingresses with those secrets, and you use the "gce" or default ingress class, then you end up getting TLS termination at the GLB level, very easily.

However, because of the 10-certs-per-LB, you run into the problem I described before. There's no way around it except to create ingresses that contain a maximum of 10 hosts. (It's a hard limit, not a quota; you can't petition to have the limit increased.) If you have 60 domains, that necessarily means 6 GLBs, and 6 different IPs to point DNS to.

We're using Traefik today as a custom ingress controller with a TCP GLB. So we get one external IP, and Traefik handles TLS termination (via Let's Encrypt). So this is effectively the same as running cert-manager plus something like the Nginx ingress controller.

Yeah, I should have clarified in my case I am using the nginx ingress controller, but Traefik would also suffice.

They supported it on ELBs and ALBs, but not NLBs.

I’m not sure about gcp but on aws you can have domain validated acm certs which are arguably more convenient than LE certs.

aws will transparently renew them for you if they are associated with a load balancer or cloudfront. no moving parts required in your infra.

Dumb question, but has the industry decided on what they want to call TLS/HTTPS? Majority of people I talk to still refer to them as SSL certs...

Old habits die hard.

The current standard is TLS, which superceded SSL back in 1999. Uptake and awareness was initially slow, and SSL itself wasn't deprecated until 2015. It doesn't help that a bunch of projects (OpenSSL being a prominent one) still have "SSL" in their name.

Technically, they're not "SSL certs". They're X.509 certs. Certificates show up in many places unrelated to TLS/SSL.

Well, the things people want aren't (generally) just X.509 certificates.

If you expect your certificates to work in common client software like web browsers or some random Python client code a third party is writing then you're going to want:

* PKIX, currently RFC 5280 plus revisions, the Internet's chosen standard for how X.509 should be implemented. PKIX says a bunch of things about what you should or should not write in the X.509 certificate, you will probably have better luck conforming as much as possible even if you don't care about:

* The Web PKI. A Public Key Infrastructure has Certificate Authorities (plenty of those across all of X.509 or you can roll your own) but they're trusted by Relying Parties (parties who are _relying_ on the certificate's attestation to be true) and you probably want certificates that will be trusted by most RPs. Grandma doesn't know what the X.500 directory system is, or who a Certificate Authority is, but she does use Safari, and Safari trusts certain CAs on her behalf as does macOS. So you're going to want certs from one of those CAs. The Web PKI is a loose name for (despite that word "Web") the PKI covering SSL/TLS services on the Internet.

* You may want to get more specific. Although the Baseline Requirements which say roughly how a Certificate Authority should work are agreed across industry, each of the Major Trust Stores has their own additional rules. You probably care about all of them. They roughly correspond to operating system vendors. Microsoft and Apple (for their browsers and operating systems), then Mozilla (for Firefox on all platforms, and for Free Unix systems), Google (mostly just for Android, not Chrome), and then runners up Oracle (for Java), then a long tail of people including Nintendo and all those crappy in-car entertainment systems people...

For example, probably fifty people in the whole world have ever bothered trying to use the Web from a Nintendo WiiU (not to be confused with the popular Wii) console. If they try now though lots of things don't work. Because Let's Encrypt isn't trusted by the WiiU, and since it's an end-of-life product, probably never will be.

But "SSL Certs" gets across what you mean pretty well, only pedants are going to insist it's wrong, and unless you're currently playing "Um, Actually" they should cut it out.

For the most part, it's called SSL. Hell, it's even more taxonomically coherent, take it from the bear:

> So basically, the “SSL vs TLS” taxonomy represents the ownership change in 1999, but not the technical details of the protocols. From a purely technical point of view, lumping together SSL-3.0 with SSL-2.0 but separating it from the TLS versions is about as wrong as, in biology, classifying whales as a kind of fish (despite what Herman Melville wrote about them).

> Therefore, I tend to use “SSL” (or “SSL/TLS”) to designate the whole conceptual family of protocols, from SSL-1.0 to TLS-1.2. This is the main reason why BearSSL is called BearSSL and not BearTLS.

> A secondary reason is more about marketing. People who know what TLS is also know what SSL is, but not the other way round. For instance, Web site owners look for a “SSL certificate”, not a “TLS certificate”. By using the “SSL” acronym, I thus reach a wider audience, even if it is at the cost of irritating some of the more pickish taxonomists (and I am totally ready to out-pedant them if need be, as demonstrated above).


PolarSSL took the plunge and renamed. It isn't impossible.

TLS is our octopi.

My customers won't stop calling them SLL Certificates. Drives me mad.

it is TLS...SSL is just much better known and likely would be around until written references to SSL go away..it would be a while :)

SSL references "sockets", which is still where TLS is implemented but TLS is conceptually about transport security independent of implementation (ala sockets)....for instance consider this TLS inspired solution (https://www.cipheredtrust.com/doc/), it has nothing to do with sockets.

How do they ensure the target machines see the original IP address and port?

They encapsulate the packet, including data about the original IP address, as it is sent to the hypervisor on the ec2 instance. Then the hypervisor reverses the process and creates a packet with the original IP address as a source when it is forwarded into the vm.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact