
How CDNs Generate Certificates - ordiblah
https://fly.io/blog/how-cdns-generate-certificates/
======
tialaramex
It would be interesting to see stats from the CAs about which of the Blessed
Methods is most popular. (This article is about Let's Encrypt using tls-
alpn-01 which is an implementation of 3.2.2.4.10 "TLS Using a Random Number").
Doubtless Fly aren't the only people doing tls-alpn-01 in bulk but we don't
have a good overview as far as I'm aware.

In principle they can all generate those statistics because they (are supposed
to) log enough information to identify what went wrong when, inevitably,
something is misissued. Logically that also includes at least which method was
used to verify domain authorization or control.

One of the things wrong at Symantec is that it turns out some of the records
were notionally kept at CrossCert, a separate Korean company. CrossCert simply
did not keep any records (or if it did they were in such disarray that it
seemed less likely to attract retribution by refusing to disclose them) and
Symantec had seemingly never checked.

Knowing which methods are popular with Subscribers, and whether that varies
considerably between CAs would be valuable in trying to figure out how more of
the worst Blessed Methods can be deprecated or improved, and who we need to be
talking to about that.

For example maybe Let's Encrypt is doing almost all the 3.2.2.4.19 ("Agreed
Upon Change to Website - ACME") then there's no point ragging on other CAs for
the shortcomings of relying on plaintext HTTP in this method. Or maybe
DigiCert are doing a lot of 3.2.2.4.15 ("Phone Contact with Domain Contact")
so they are the people to talk through any proposed improvements around stuff
like leaving a Voice mail.

------
tptacek
Part of the last few weeks involved me learning Rust and using it in anger (if
hooking nfqueue up to tokio counts as "in anger") so if you'd like to irritate
the hell out of 'pcwalton, feel free to ask me Rust questions.

~~~
dochtman
Exciting! Are you doing this in your role as Latacora helping out startups
with security challenges? (Update: apparently not
[https://twitter.com/tqbf/status/1276212163582070785](https://twitter.com/tqbf/status/1276212163582070785))

How is the Fly proxy implemented? Are you using rustls and/or any of the
available ACME crates?

I've been wanting to implement tls-alpn-01 support for rustls (although it
might be possible to do this just by mutating the ServerConfig over time).

Also interested to hear your general impressions of Rust so far (I think I
read some Twitter grumbling...).

~~~
tptacek
I'm full-time at Fly. I'll let Jerome answer the fly-proxy question, since
it's his code and I wouldn't want to inadvertently take credit.

I think I came across as grumbling about Rust when my real perspective was
much more subtle. My take on Rust so far is that it has been, for me, a
vindication of a lot of decisions the Go team made, because I've been directly
exposed to some of the downsides of the opposite decisions. But, while that
sounds like a critique of Rust, it's not! Rust is the way it is for real
reasons: zero-cost abstractions and no runtime GC, which are, right now,
requirements for some application domains.

For me, right now, writing in Rust feels almost identical to how writing in
C++ felt 15 years ago. But I'll keep writing in it, and it'll get faster for
me. We're a Rust-on-the-data-plane shop!

~~~
JoshTriplett
If you run into issues in Rust that you believe might be signs of a need for
language improvements, please feel free to raise them. I'm happy to help.

------
ancarda
Is anyone else feeling quite sad reading this article? ALPN being used because
only 80/443 are realistic these days, middleboxes causing the TLS handshake to
have padding so it's not misinterpreted with an ancient protocol (SSLv2).

It feels like the Internet is so fragile.

~~~
SahAssar
Most of this could have been avoided by using DOH and SRV records for
HTTP/HTTPS. I still don't understand why SRV records is not supported for
HTTP/HTTPS in browsers.

~~~
ancarda
I remember looking into why A/AAAA is still used over SRV, and it would seem
performance is one of the big concerns; browsers do not want to make more DNS
lookups than necessary.

I think they'd end up with 4 lookups; A, AAAA, SRV (_http2._tls), and SRV
(_http._tls).

Though perhaps you are suggesting DoH could mean the resolver also returns SRV
records if you request A or AAAA? i.e. proactively point out there's an HTTP
server?

~~~
jiggawatts
This debate comes up a lot, and it's hilarious how misguided it was.

I regularly work with load-balancers such as Citrix ADC (NetScaler) or F5 BIG
IP. These do DNS-based load-balancing, dynamically returning "A" records to
that the browsers so that they can get the "single working IP address" they're
expecting. The browsers don't try very hard to fail over to secondary IPs
because this is the established standard architecture, but they don't need to
because of this common setup.

Sounds like an optimal solution, right? It does at first glance anyway, as
long as you ignore the eye-watering price tag on those load balancer boxes.

The subtle but critical issue is that by returning "A" records, the load
balancers have to use a short time to live (TTL)! This is because there's a
trade-off: You can have fast failover, OR long-lived DNS caching. _With A
records you can 't have both!_

Typical response TTL times are 5-30 seconds, 5 minutes tops if you hate your
users. This means that many browsers will be forced to repeatedly re-query the
DNS servers on _every page load_ for typical end-user workflows. It also means
that for all but the biggest, most popular sites, the ISP DNS cache does
practically nothing for these records.

Meanwhile with SRV records the TTL times can be much higher, hours even. This
is how Active Directory works, for example, all of the Domain Controllers add
themselves to various SRV records so that if you query
"_ldap._tcp.dc._msdcs.test.com" you get back all the DCs. These records
include priorities and weightings, so you can pull tricks like incrementally
demote a DC or prioritise the shiny new one.

If you watch the AD connection traffic in WireShark, it's incredible. It very
quickly steps through alternate services and then reorders the successful hits
in front of the failures so that subsequent queries are lightning fast. It is
astonishingly tolerant of partial networking failures, yet still fast to
connect despite that!

The key mistake made by the original DNS design working groups was that SRV
records should have returned a list of IP addresses instead of a list of host
names.

~~~
diroussel
Yes, there does seem to be something missing in the semantics of IP lookup. It
ought to be a user agent can look up an IP with "keep using until it stops
working" sematics. Or "choose one of these 3 IPs, can keep using the one you
chose until there is a problem. Or something more similar to how we want
client side site availablity logic to work.

------
Karupan
> We proxy traffic from edge servers to containers through a global WireGuard
> mesh.

I am more interested in the mesh. Do you have more details on that?
Specifically why this architecture was chosen, what kind of latency does
WireGuard add, etc.

~~~
mrkurt
Ooooh I love Wireguard, we'll have an article about this in the next couple of
months.

We picked it because it's really simple to manage, and we wanted to ensure
traffic between datacenters was always encrypted. We have a little tool called
"flywire" that keeps wireguard peer configs updated from Consul. Once we
accept a connection from a user, we pick a target VM, and then connect them
over the wireguard mesh.

For our purposes, it basically doesn't add any noticeable latency. I think
when we tested we say something on the order of 0.1ms of added latency over
wireguard, but I don't quite remember. It's never been the source of latency
problems when we do have them, at least!

~~~
Karupan
Thank you. Can’t wait for a more detailed write up.

P.S: I’ve built a service using Fly and can’t recommend it enough!

------
hashamali
Question about fly.io: do you support HTTP/2? I have wanted to put gRPC
services directly on the edge but most managed services make it completely
convoluted to set up (the lone exception being Google Cloud Run).

~~~
tptacek
We do! Here's a walkthrough of an H2-based DOH server running on Fly; H2
should just work.

[https://fly.io/docs/app-guides/run-a-private-dns-over-
https-...](https://fly.io/docs/app-guides/run-a-private-dns-over-https-
service/)

(Beyond that, for whatever it's worth: you can skip our HTTP/H2 termination
entirely and speak TCP directly to your VMs).

------
mholt
Anyone looking to automate certificate management at any sort of scale should
read this: [https://docs.https.dev/acme-ops](https://docs.https.dev/acme-ops)

... and use Caddy to do the heavy lifting. (I'm biased, yes. But the linked
doc is multi-authored and applies to every sysadmin or developer who needs to
manage certs, regardless of your software choice.)

~~~
mrkurt
That is a fabulous article.

I also love Caddy. In fact, you can run it on Fly.io (and even opt out of our
TLS/cert stack). I would love it if it could just put certs in Vault, though.

------
lomkju
Can you tell why should I choose fly instead of AWS?

micro-2x shared 512MB $0.000003044 $8 VS t3a.nano 2 Variable 0.5 GiB EBS Only
$0.0031 per Hour

I'm missing something? cause seeing the pricing I still feel AWS is cheaper.

~~~
mrkurt
It's probably better to compare Fly with Lambda or Fargate. It's not really
meant to be cheaper than AWS, though, the real value is being able to run app
servers all over the world without spending time maintaining servers or
wrangling AWS.

~~~
lomkju
Makes sense. Comparing the pricing with AWS lambda fly.io is way cheaper. Will
give it a try :)

------
awinter-py
woo, hadn't heard about firecracker

~~~
tptacek
Firecracker is f'ing awesome. I have a lot of notes to write up about it. I
know this isn't how products actually succeed in the real world, but I'll be
honest and say that Kurt had me at Fly with "WireGuard and Firecracker".

(For the unfamiliar reader: Firecracker is a micro-vm system that sits sort of
in between a fully virtualized host, like an EC2 instance, and a container
like Docker; you get the security isolation of a hypervisor but the
speed/simplicity of Docker. It's the engine that powers AWS Lambda and
Fargate. The Usenix paper is a pretty great read, and the code [it's all in
Rust] is simple and easy to follow.)

[https://www.usenix.org/system/files/nsdi20-paper-
agache.pdf](https://www.usenix.org/system/files/nsdi20-paper-agache.pdf)

~~~
AlphaSite
It’s fairly similar in concept to:
[https://vmware.github.io/vic/](https://vmware.github.io/vic/) for vsphere

Disclaimer: interned with the team

~~~
tptacek
Say more, if you can! I'm not at all familiar with that project. Thanks!

