Hacker News new | past | comments | ask | show | jobs | submit login
The database servers powering Let's Encrypt (letsencrypt.org)
529 points by jaas on Jan 21, 2021 | hide | past | favorite | 226 comments



I was, long ago, an old-school Unix sysadmin. While I was technically aware of how powerful smallish servers have become, this article really crystallized that for me.

64 cores and 24 NVME drives in a 2U spot on a rack is just insane compared to what we used to have to do to get a beefy database server. And it's not some exotic thing, just a popular mainstream Dell SKU.

If you price it out on Dell's site, you get a retail price north of $200k. That is really what made it clear for me. That you could fit $200k+ worth of DIMMS, Drives, CPUS into a 2U spot :)


Just so everyone here is aware re: pricing...

If you buy through a VAR and/or Dell reps you don't pay the price on the website. What you actually pay is typically significantly lower. I don't think anyone actually buys servers like these by just ordering from the website. We (Let's Encrypt) certainly don't.

These are expensive servers, crossing into six digits, but not $200k.


Literally the only reason to display a retail price on the website is so sales reps can say "see we are giving you a massive XX% discount!"


I wonder if such “price on application” behaviours is partially what drives services like AWS

I’m looking at buying some fortigate firewalls to do some natting, looks like 200Fs will be fine, but even the fortigate sales guys won’t give me a price, I have to go to resellers who also refuse to give prices, which adds friction. Cisco exactly the same.

When looking at options, price is at the forefront of my mind, but sales guys want me to choose their company, and even commit to the specific device and numbers, before I even see the price.

Almost walked away from the fortigate option until I found avfirewalls which gave me a ballpark idea of what I could afford to implement, and what trade offs. The benefits of the fortigate over a mikrotik were worth it at that price, but it was painful getting the price out and they nearly lost the sale as I assumed it would be 10 times more.


You can back into the price from used gear.

Ebay has them for 3.2k and CDW new for 3.6.


The same bullshit as US medical prices, it seems:

- Hospitals have an interest in massively inflated high sticker prices to make as much money as possible from those who pay in cash, e.g. foreigners, and to make the debts look better for collection agencies - basically, assuming the true cost of a procedure is 1000$, when the hospital bills 1000$ and that goes to collections who buy the debt for 10% of value the hospital gets 100$, and if the hospital bills 10k $ and the collections agency pays 10%, the hospital gets 1000$ or the actual cost

- Insurances have an interest in high sticker prices because they will negotiate with the hospital to pay true cost + some markup anyway - so with a higher sticker price, they can claim that their insurance saves the buyers a higher percentage

- Employers have an interest in high sticker prices because they can market themselves as employers who provide a better health insurance than the competition

The people losing out in this gamble are those who cannot afford insurance and have to declare medical bankruptcy.

And in the case of hardware or even some "contact sales for a quote" SaaS it's the same end result: the ones who lose out are small businesses (who can't achieve the sales amount to qualify for cheap-ish rates), and the big companies with dedicated account managers have a nice life.


But don't they lose sales from people who never bother to call a sales rep because they think it's out of their price range?


Sounds about right.


This is Anchoring, a cognitive bias where an individual depends too heavily on an initial piece of information offered to make subsequent judgments during decision making.


There's a good thread on the sysadmin reddit where folks share the actual prices they're paying


Link for the curious?



Yes, that's worth noting. The $200k is the list/retail, which nobody would pay for a purchase of this size.


At this size/scale I wonder what the actual price would be for a comparable Supermicro system. If I had to make a wild guess, well under half the 200k previously quoted.


The NVME drives are what's making it so expensive. Dell retail is $6044 for each of the Intel P4610 6.4TB drives. That's $145k.


after 15 seconds of googling, the P4610 U.2 format 6.4TB seems something more like a single unit street price of $2400 from non-Dell vendors. I'm mildly surprised it's that low considering that the U.2 format stuff, for serious servers, will always command a premium price.

Probably in the range of $2100 to $2200 per unit from a x86-64 component distributor in moderate quantities.


Just noticed this in the Dell cart...

"42% off list price: use code SERVER42"

Doesn't make the price reasonable exactly, but it's kind of funny.


You should try server43


server100 for the win


server1000 and we can retire early


it seems like server-1000 is actually what one might be looking for


It looks like performance-wise I can compare it to a couple of 970 EVOs stuck together, and those are $160/TB retail. So from that point of view you're paying more than double.


Well, now I have to ask an old boss if he called a rep, he didnt spend quite 100k+ but we did get a twenty grand single server from dell about half a decade back or more, and now I'm curious, I think it was about 20 grand, I don't think NVMe's were mainstream at the time, or anywhere damn near affordable. I suspect he did buy it directly off the site, as he has other hardware we'd have gotten (mainly workstations).

In a way you could say the ones not calling a rep are subsidizing your cost for the server.


This goes for most things. We buy Dell laptops, and the price drops like 20-40% easy.


One of the interesting things about threadripper and epyc systems with a lot of PCI-E 4.0 lanes on one motherboard, if your goal is not NVME disk I/O, but rather network throughput, it also provides the opportunity for a DIY approach to very high capacity multi-100GbE ethernet interface routers.

Such as a fairly low cost 4U threadripper box, running VyOS (ultimately based on a very normal debian Linux) with multiple Intel 100GbE NICs in PCI-E slots.

The Intel linux kernel (and FreeBSD) driver support for those is generally excellent and robust. Intel has had full time developers doing the Linux drivers for that series of cards since the earliest days of their 64-bit/66MHz PCI-X 10Gbps NICs about 17 years ago.

Also worth mentioning that FRR is now an official Linux Foundation supported project.

https://frrouting.org/

https://www.intel.com/content/www/us/en/products/docs/networ...


Altavista used to run their entire search engine on a single computer twenty years ago. Imagine what Mike Burrows could accomplish today with one of these 2U EPYC NVME 200gbps PCs. It'd probably trigger the singularity.


For a comparison, you could get a 64 processor, 256Gb ram Sun Starfire around 20 years ago[1]. Wikipedia claims these cost well over a million dollars ($1.5 million in 2021 dollars). This machine was enormous (bigger than a rack), would have had more non-uniform memory access to deal with, and the processors were clocked at something like 250-650MHz.

[1] https://en.wikipedia.org/wiki/Sun_Enterprise


That's a pretty good comparison. It weighed 2000lbs, and was 38 inches wide, or basically 2 full racks, which I guess you could call an 84U server. It was also 49 inches deep, versus a standard rack which is 36 inches deep.


iirc, anton got a kernel compile down to about a minute with -j32 back in the day while at ozlabs?


I have a motherboard from 2012 and I just put 2x 8TB NVMe SSDs on it, on a PCIe 2.0 x16 slot

Works great. The PCIe card itself has 2 more slots for SSDs

The GPU is on the 2.0 x8 slot because they don't really transfer that much data over the lanes.

I honestly didn't realize PCIe was up to 4.0 now, and I am pushing up against the limits of PCIe 2.0 but it still works! And I’m “only” at the limits, and its only a limit when I want faster than 3,000 megabytes per second, which is amazing.

Granted, this would have been considered a good enthusiast motherboard in 2012. Buying new but cheap is the mistake.


What drives did you get? I think you need PCI 4 to stress most SSDs these days?


I have a hunch that the pcie card itself is most important as it is doing bifurcation.

So each drive acts like it has its own slower (but fast enough) pcie slot, and then the raid0 combines the bits back to double the performance.

Could be wrong but I get 2,900 megabytes per second transfers from RAM to disk and back.

And this is PCIe 2.0 x16

so maybe if you want , 3,000, 4,000 or 6,500 megabytes per second then I have nothing to brag about. I’m pretty amazed though and will be content for all my use cases.


STH recently did a video about this dense mass of storage & compute they call a server. iirc they specced it with their discounts to end up at a much more reasonable ~$90k.

https://youtu.be/9PLvt3WATMU


And if you don't want to buy Dell, you can get a SuperMicro server that has about 2x the specs for half the price ($102k).

https://www.siliconmechanics.com/system/a+-ultra-superserver...

2U

128-Core

307TB NVME

4 TB RAM

-------

= $102k


We profiled Supermicro vs. Dell vs. HP vs. IBM servers a decade ago for trading applications.

Supermicro was cheaper by 20-30% had a quantitatively better power profile which basically meant it paid for itself and just rocked on every dimension. And their sales reps were able to answer every question we had including sending us a excel model of their power usage. The dell folks never got back to us on power usage.

I dunno why folks insist on Dell crapola.


I'm surprised that's what did it for me. I admined some sun enterprise servers and was blown away when I bought a pi 3 yrs and realized the pi was probably faster than it. That enterprise server would have 100-200 engineering students logged in and working at once. :-/. We're living in the future.


I like to play a game called "A current Raspberry Pi (or mobile phone) is equivalent to what year's global computing power?"

Also "fun" is "The solid state storage I can buy for $100 is equivalent to all the world's computing storage in what year?"

I'm super fun at parties!


> the pi was probably faster than it.

If we're talking about 90s or early 2000s Sun, probably. Even then, though, these systems probably had substantially better I/O performance.

I have a SPARC laptop lying around from 1995 that gets a whopping 20 MB/s in disk read/write speeds across its dual half height SCSI drives. That still beats all but the best SD cards.

The full size systems with SAS and other options would get even better disk performance.


Totally. 2 TB of RAM! In one box!

I think the first servers I had in production had 8 MB RAM. No more, certainly. Soon we'll be at 1000x that. My dad's first "server" was 3 orders of magnitude smaller, with 8 KB of RAM (hand-wound wire core memory). In that time, the US population hasn't even doubled.


https://yourdatafitsinram.net/

The listed servers are from few years ago so I guess there might be some bigger ones available these days.


More like 1,000,000x


Oops! Yes, that's correct.


My phone in my pocket has 1,000 times 8MB of ram. A million of your first serves in one box. In about 30 years?


We made so much progress in that field over the last 40 years... I sometimes get lost trying to imagine what will be the computing performance available in 100, 1000, 10000 years from now...


And I'm afraid it will still obey this trend.

https://danluu.com/input-lag/


That doesen't seem like some fundamental restriction, but rather a compromise we're always willing to make. 30ms is not noticeable, so we don't try to lower it and sacrifice something else.

Even the super slow ones like on kindle, are _choices_ that have been made in favor of something else. a second to turn a page on a book isn't unbearable


30ms latency is acceptable. But that's the best we ever had.

High-end Android phones have 120+ ms latecy. That's easily noticeable and actually annoying (at least to me).

My personal pet peeve is the latency of input when i'm starting new applications in KDE. E.g. I'll start a new terminal with Super+Enter, followed by a Super+Right Arrow in order to tile it to the right. But the latency is big enough, that often it's not the terminal that ends up tiled, but the application that had focus earlier, e.g. a web browser. It's really annoying.

I also don't understand how still in 2020 when I move a window with the mouse, the window can't keep up with the mouse.

30ms input lag is what we should work towards. But that's not what we have today. Today is actually crap.


It’s so frustrating that our phones and computers can’t move at the speed our mind moves. I know it could if someone cared to try to make it a priority.


I think that if you had a top-to-bottom approach to a Linux distro designed entirely for latency of user interaction, you'd get something that would keep up with you.

I keep meaning to try out WindowMaker again as my Fedora window manager. I feel like it would be incredibly speedy on today's hardware.


Sticks and rocks most likely.


Yep, quantum entanglement materials thru and thru


Single threaded performance isn't keeping pace unfortunately.


That just means multithreaded will become more and more important. Languages like rust may gain some usage as we start to require safe multithreading.


That may be so, but that doesnt make it any less depressing. Some things cant be run parallel. I cant mail a letter until it is written.


Dell-controlled VMWare is listed under "major sponsors and funders". I wonder if Let's Encrypt got a discount on that server. Good for them!


You can build a pretty nice home 32-core/256gb threadripper system for ~ $5k. (24 nvme drives not included)


Dell makes it easier than it should be by having absurd markups on storage.


I'm thankful for their OpenZFS tuning doc which they developed as part of this server migration: https://github.com/letsencrypt/openzfs-nvme-databases

The one thing that I get hung up on when it comes to RAID and SSDs is the wear pattern vs. HDDs. Take for example this quote from the README.md:

We use RAID-1+0, in order to achieve the best possible performance without being vulnerable to a single-drive failure.

Failure on SSDs is predictable and usually expressed with Terabytes Written (TBW). Failure on spinning disk HDDs is comparatively random. In my mind, it makes sense to mirror SSD-based vdevs only for performance reasons and not for data integrity. The reason is that the mirrors are expected to fail after the same amount of TBW, and thus the availability/redundancy guarantee of mirroring is relatively unreliable.

Maybe someone with more experience in this area can change my mind, but if it were up to me, I would have configured the mirror drives as spares, and relied on a local HDD-based zpool for quick backup/restore capability. I imagine that would be a better solution, although it probably wouldn't have fit into tryingq's ideal 2U space.


> Failure on SSDs is predictable and usually expressed with Terabytes Written (TBW). Failure on spinning disk HDDs is comparatively random.

That wasn't my experience with thousands of SSDs and spinning drives. Spinning drives failed more often, but usually with SMART sector counts increasing before hand. Our SSDs never got close to media wearout, but that didn't stop them from dropping off the bus. Literally working fine, then boom can't detect; all data gone.

Then there's the incidents where the power on hours value rolls over and kills the firmware. I believe these have happened on disks of all types, but here's a recent one on SSDs [1]. Normally when building a big server, all the disks are installed and powered on at the same time, which risks catastrophic failure in case of a firmware bug like this. If you can, try to get drives from different batches, and stagger the power on times.

[1] https://www.zdnet.com/article/hpe-says-firmware-bug-will-bri...


> Failure on spinning disk HDDs is comparatively random.

comparatively, yes, but when averaged out over a large number of hard drives it definitely tends to follow a typical bathtub curve failure model seen in any mechanical product with moving parts.

https://www.itl.nist.gov/div898/handbook/apr/section1/gifs/b...

early failures will be HDDs that die within a few months of being put into service

in the middle of the curve, there will be a constant steady rate of random failures

towards the end of the lifespan of the hard drives, as they've been spinning and seeking for many years, failures will increase.


SSD’s still fail, just not often.

State of the art systems keep ~1.2 copies (e.g. 10+2 raid 6) on SSD, and an offsite backup or two. The bandwidth required for timely rebuilds is usually the bottleneck.

These systems can be ridiculously dense; a few petabytes easily fits in 10U. With that many NAND packages, drive failures are common.


An early mitigation strategy has been to use different brands of ssd so that failure rates get more spread out. For raid 6 we have started out with max 2 drives of a single brand.

The result for us was 2 drives that failed within the same month, of the same brand, and from there it seems to be single failures rather than clusters.


The tuning is almost identical to what we have in production. A few comments:

* I would probably go with ashift=12 to get better compression ratio, or even as far as ashift=9 if the disks can sustainably maintain the same performance. Benchmark first, of course.

* We came to the same conclusion regarding AIO recently, but just today I did more benchmark, and it looks like ZFS shim does perform better than InnoDB shim. So I think it's still fine to enable innodb_use_native_aio

* We use zstd compression from ZFS 2.0. It's great, and we all deserve it after suffering through the PR dramas.


> The reason is that the mirrors are expected to fail after the same amount of TBW

You could fix that by writing a bit more to one of the disks, e.g. run badblocks for different amounts of time before putting them in service.


Just goes to show how much a single SQL server can scale before having to worry about sharing and horizontal scaling


Based on their stated 225M sites and a renewal period of 90 days, they're probably averaging around 40 certificates per second. That's only an order of magnitude higher than bitcoin; I wouldn't call it an indication of an ability to scale to a particularly large amount of traffic.


They might average that, but we all know averages only work on paper. For example, AWS has tutorials that provide instructions on how to setup TLS in a LAMP stack running on Linux 2 EC2s. As part of the Let's Encrypt setup, they provide a crontab entry that runs twice a day with a copy option to paste. How many EC2s all hit the Let's Encrypt server at that exact time? Since EC2s default to UTC time, that means that servers are not offsetting those requests by timezones, so that means an even bigger spike.


OpenBSD recently added the "~" random range separator to their crontab syntax. The manual for acme-client, the native Let's Encrypt tool, provides

  ~  *  *  *  *  acme-client example.com && rcctl reload httpd
as the example crontab entry. Though, for maximum friendliness to Let's Encrypt that should probably be something like "~ ~ * * ~", which would run the command at a random time once per week. I think you could accomplish something similar using systemd timer units and RandomizedDelaySec, assuming it permits a delay as high as 604800.


You probably should not run it only once a week. That means it takes up to a whole week to even attempt to automatically fix a problem, which is a long time. True that's going to be four attempts before, finally, the certificate actually expires (if the problem is just intermittent service availability), but it doesn't seem worth the small additional risk.

Now, sure, in principle you should have active monitoring so you'd know immediately if there's a problem status e.g. certificate with only 14 days left until expiry; revoked, expired or otherwise bad certificate presented by server; OCSP staples missing. But we know lots of people don't have monitoring, and I guarantee at least one person reading this HN thread is mistaken and isn't monitoring everything they thought they were.

Like Certbot acme-client does a local check before taking any action, so if run once a day (or indeed once an hour) it will not normally call out to the Let's Encrypt service.

Unlike Certbot acme-client doesn't do OCSP checks, so it won't even talk to the network to get an OCSP response. Even with Certbot this (OCSP) is provided through a CDN (so it's roughly similar cost to fetching a static image for a popular web site) and so your check is negligible in terms of operating costs for Let's Encrypt.


Its worth noting that running certbot / LE twice a day doesn't actually hit the LE server twice a day. It just checks the certificate dates locally and if they have been renewed in the last month it does nothing.

I guess you still have a peak time of 00:00 UTC every day though unless people are using servers set to their local time.


I believe modern Certbot also does OCSP. The intent here is, if for any reason your certificate was revoked it makes sense to try to obtain a new certificate even if it hasn't expired. Even if it can't, perhaps the operator will notice a logged reason from Certbot when they did not yet notice their certificate was revoked.

Examples of reasons your certificate might have been revoked:

* You re-used a private key from somewhere, perhaps because you don't understand what "private" means, and other copies of that key leaked

* You didn't re-use the private key but your "clever" backup strategy involves putting the private key file in a public directory named /backup/server/ on your web server and somebody found it

* You use the same certificate for dozens of domain names in your SEO business and yesterday you sold a name for $10k. Hooray. The new owner immediately revoked all certificates for that name which they're entitled to do.

* Your tooling is busted and the "random" numbers it picked aren't very random. (e.g. Debian OpenSSL bug)

* A bug at Let's Encrypt means their Ten Blessed Methods implementation was inadequate for some subset of issuances and rather than cross their fingers the team decided to revoke all certificates issued with the inadequate control.

* Let's Encrypt discovers you're actually a country sanctioned by the US State Department, perhaps in some thin disguise such as an "independent" TV station in a country that doesn't have any independent media whatsoever. It is illegal for them to provide you with services and you were supposed to already know that.

So that is a network connection, but not to the Let's Encrypt servers described in this story.

In practice OCSP is done by a big CA by periodically computing OCSP responses for every single unexpired certificate (either saying it's still valid, or not), and then providing those to a CDN and the CDN acts as "OCSP server" returning the appropriate OCSP response when asked without itself having possession of any cryptographic materials.


Certbot has a random sleep before it does anything when run non-interactively.


Yes. They are not doing a very heavy computational workload. Typical heavy-duty servers these days can do 100k's or millions of TPS. 40 TPS is a really, really, really light load.

Further, I was looking at those new server specs. There's an error I think? The server config on the Dell site shows 2x 8 GB DRIMMs, for 16 GB RAM per sever, whereas the article says 2 TB!

With only 16GB of RAM, but 153.6 TB of NVMe storage, the real issue here is memory limitation for a general-purpose SQL database or a typical high-availability NoSQL database.

Check my math: 153600 GB storage / 16 GB memory = 9600:1 ratio

Consider, by comparison that a high data volume AWS i3en.24xlarge has 60TB of NVMe storage but 768 GB of RAM. A 78:1 ratio.

If the article is correct, and the error is in the config on the Dell page (not the blog), and this server is actually 2 TB RAM, then that's another story. That'd make it a ratio of 153600 / 2000 = ~77:1.

Quite in line with the AWS I3en.

But then it would baffle me why you would only get 40 TPS out of such a beast.

Check my logic. Did I miss something?


Why are you assuming that their workload includes just one query per emitted certificate?

The reality is that they are storing information during challenges, implementing rate limiting per-account, supporting OCSP validation and a few other things.

You can investigate further if you really want to see the queries that they make against the database since their software (Boulder) is open source [1]. Most queries are in the files in the "sa" (storage authority) folder.

[1] https://github.com/letsencrypt/boulder/


Kudos for sharing the location of relevant code. This is the kind of referencing we need


They don't only GET 40 TPS, that value is an estimate what they serve (above it was site/90 days).

They have capacity for much, much more with that hardware


Yes, you can customize the config on Dell's site to upgrade the memory.


Don’t these certificates have long RSA keys which are more expensive computationally? Though I guess that doesn’t have to happen on the database server.


The only RSA computations Let's Encrypt need to do are:

* Signature by their RSA Intermediate (currently R3, with R4 on hot standby) - which will be a dedicated piece of hardware - to issue a subscriber's certificate. In practice this happens twice, as a poisoned pre-certificate to obtain proof of logging from public log servers, and then the real certificate with the proofs baked inside it.

* Signatures by their OCSP signer periodically on an OCSP response for each certificate saying it's still trustworthy for a fixed period. Again this will be inside an HSM.

* Signature verification on a subscriber's CSR. To "complete the circuit" it's helpful that Let's Encrypt actually confirms you know the private key corresponding to the public key you wanted a certificate for, the signature on your CSR does this. Some people don't think this is necessary, but I believe Let's Encrypt do it anyway.

You're correct that none of this happens on the database servers. I guess it's possible their servers use TLS to secure the MariaDB connections, in which case a small amount of either ECDSA or RSA computation happens each time such a connection is set up or torn down like at any outfit using TLS, but those database connections are cached in a sane system so that wouldn't be very often.


Does "certbot renew" talk to the mothership at all if no certs are ready for renewal? If it does, most setups I've seen run the renewal once or twice a day since it only does the renew when you're down to 30 days left. There may also be some OCSP related traffic.


Certbot will look at the expiration timestamp on your local certs without talking to Lets Encrypt.


There is a —force though. I had to do it once because they thought I got a bad cert (don’t remember the details)


Bitcoin is ECDSA verification, letsencrypt is generating RSA signatures, the two aren't even remotely comparable.


But the cryptographic bottleneck is in HSMs, not in database servers (database servers don't generate the digital signatures, they just have to store them after they've been generated).


Bitcoin is actually not limited by compute power at all. Its an artificial cap on transaction rate to prevent the blockchain from expanding too large and preventing normal users from hosting the whole thing.

You can see the blockchain size was growing exponentially but then switches to linear as we hit the transaction cap and it now sits at about 350GB


Read performance is much easier to scale (in one box or several) than write performance. It's usually the writes that make you look at Cassandra and similar, instead of adding more disks and RAM, or spinning another read-only replica.

24 NVMEs should have a lot of write throughput, though.


This is especially true with your own hardware. Trying this kind of thing in the cloud is usually prohibitively expensive.


It’s doesn’t have to. Unless you’re conditioned to believe that aws is cheap


I'm curious what the cost would be to run this type of hardware at any cloud vendor? Does it even exist?


That's the wrong way to think about the cloud. A better way to think about it would be "how much database traffic (and storage) can I serve from Cloud Whatever for $xxx". Then you need to think about what your realistic effective utilization would be. This server has 153600 GB of raw storage. That kind of storage would cost you $46000 (retail) in Cloud Spanner every month, but I doubt that's the right comparison. The right math would probably be that they have 250 million customers and perhaps 1KB of real information per customer. Now the question becomes why you would ever buy 24×6400 GB of flash memory to store this scale of data.


Not completely comparable, but Hetzner offers the following dedicated server that costs 637 euro/month (maxed out):

- 32-core AMD EPYC

- 512GB ECC memory

- 8x 3.84TB NVMe datacenter drives

- Unmetered 1gbps bandwidth


(I work at AWS, but this is just for fun)

Checking out AWS side, the closest I think you'd get is the x1.32xlarge, which would translate to 128 vCPU (which on intel generally means 64 physical cores) and close to 2TB of RAM. nvme storage is only a paltry 4TB, so you'd have to make up the rest with EBS volumes. You'd also get a lower clock speed than they are getting out of the EPICs


spittakes reading the suggestion of replacing NVMe with EBS

I mean, yeah, I guess you can. But a lot depends on your use case and SLA. If you need to keep ultra-low p99s — single digits — then EBS is not a real option.

But if you don't mind latencies, then yeah, fine.

Don't get me wrong: EBS is great. But it's not a panacea and strikes me as a mismatch for a high performance monster system. If you need NVMe, you need NVMe.


It wasn't a suggestion of replacement, it was saying that even the closest instance still doesn't get you 24TB of on-host storage, having to use EBS instead.


This starts going down a rabbit hole of tiered storage then. Because unless you design for that, you can only be as fast as your slowest storage. It'd be like tying a 20 kg weight to the leg of a sprinter otherwise.


If you eschew RDS, the largest you can go up to seems to be a u-24tb1.metal.

448 vcpu, 24TiB of RAM, $70 an hour. ~$52k per month.


Oh, only $70 per hour. I think I can start that for an hour or so before my wallet is exhausted :P


I spun up a 32 vCore spot instance to crack my own WiFi password once. It only cost ~$2 and I felt so powerful.


Stopping by to say, 9ms API response time is just ridiculously quick. You're starting to run into the laws of physics and client proximity to the datacenter where those machines live. That's a pretty amazing feat. I would assume the next step for scaling is getting those read replicas deployed across the world in order to cut down on RTT.


9ms 50%ile node latency is good, but that number is normally 1ms for big internet services. See https://dl.acm.org/cms/attachment/3658918e-7081-4676-beec-aa... Mission critical stuff goes even faster like BigTable which has numbers 4x better than the figure.


Why would they need to bother when it's mostly machines talking to machines? Certbot doesn't care that it took 90ms vs 9ms.


The 9ms is one indicator that the new hardware platform has ample extra capacity for future growth in load and traffic, it probably won't need to be replaced or upgraded for some years.


> We can clearly see how our old CPUs were reaching their limit. In the week before we upgraded our primary database server, its CPU usage (from /proc/stat) averaged over 90%

This strikes me as odd. In my experience, traditional OLTP row stores are I/O bound due to contention (locking and latching). Does anyone have an explanation for this?

> Once you have a server full of NVMe drives, you have to decide how to manage them. Our previous generation of database servers used hardware RAID in a RAID-10 configuration, but there is no effective hardware RAID for NVMe, so we needed another solution... we got several recommendations for OpenZFS and decided to give it a shot.

Again, traditional OLTP row stores have included a mechanism for recovering from media failure: place the WAL log on separate device from the DB. Early MySQL used a proprietary backup add-on as a revenue model so maybe this technique is now obfuscated and/or missing. You may still need/want a mechanism to federate the DB devices and incremental volume snapshots are far superior to full DB backup but placing the WAL log on a separate device is a fantastic technique for both performance and availability.

The Let's Encrypt post does not describe how they implement off-machine and off-site backup-and-recovery. I'd like to know if and how they do this.


> The Let's Encrypt post does not describe how they implement off-machine and off-site backup-and-recovery. I'd like to know if and how they do this.

The section:

> There wasn’t a lot of information out there about how best to set up and optimize OpenZFS for a pool of NVMe drives and a database workload, so we want to share what we learned. You can find detailed information about our setup in this GitHub repository.

points to: https://github.com/letsencrypt/openzfs-nvme-databases

Which states:

> Our primary database server rapidly replicates to two others, including two locations, and is backed up daily. The most business- and compliance-critical data is also logged separately, outside of our database stack. As long as we can maintain durability for long enough to evacuate the primary (write) role to a healthier database server, that is enough.

Which sounds like traditional master/slave setup, with fail over?


> Which sounds like traditional master/slave setup, with fail over?

Yes, thank you. I assumed that the emphasis on the speed of NVMe drives meant that master/slave synchronous replication was avoided and asynchronous replication could not keep up. In my mind, this leaves room for interesting future efficiency/performance gains, especially surrounding the "...and is backed up daily" approach mentioned in your quote.

The bottom line is that the old RPO (Recovery Point Objective) and RTO (Recovery Time Objective) are as important as ever.


> This strikes me as odd. In my experience, traditional OLTP row stores are I/O bound due to contention (locking and latching). Does anyone have an explanation for this?

Yes. My CTO, Avi Kivity did a great talk about this at Core C++ 2019: https://www.scylladb.com/2020/03/26/avi-kivity-at-core-c-201...

Let me boil it down to a few points; some beyond Avi's talk:

• Traditional RDBMS with strong consistency and ACID guarantees are always going to exhibit delays. That's what you want them for. Slow, but solid.

• Even many NoSQL databases written (supposedly) for High Availability still use highly synchronous mechanisms internally.

• You need to think about a multi-processor, multi-core server as its own network internally. You need to consider rewriting everything with the fundamental consideration of async processing, even within the same node. Scylla uses C++ futures/promises, shared-nothing shard-per-core architecture, as well as new async methods like io_uring.

• Between nodes, you also have to consider highly async mechanisms. For example, the tunable eventual consistency model you'd find in Cassandra or Scylla. While we also support Paxos for LWT, if you need strong linearizability, read-before-write conditional updates, that comes at a cost. Many classes of transactions will treat that as overkill.

• And yes, backups are also a huge issue for those sorts of data volumes. Scylla, for example, has implemented different priority classes for certain types of activities. It handles all the scheduling between OLTP transactions as highest priority, while allowing the system to plug away at, say, backups or repairs.

More on where we're going with all this is written in a blog about our new Project Circe:

https://www.scylladb.com/2021/01/12/making-scylla-a-monstrou...

But the main point is that you have to really think about how to re-architect your software to take advantage of huge multi-processor machines. If you invest in all this hardware, but your software is limiting your utility of it, you're not getting the full bang you spent your buck on.


> But the main point is that you have to really think about how to re-architect your software to take advantage of huge multi-processor machines.

I appreciate the response but it doesn't address my question: given that Let's Encrypt's MySQL-family RDBMS does not implement any of the multi-core/multi-socket/cpu-affinity/lock-free/asyncIO techniques used by databases like ScyllaDB, MemSQL, and VoltDB, why were they seeing 90% CPU utilization on their old Intel servers while the upgraded AMD servers were 25% (the expected range)?

I think mike_d's suggestion is most plausible: they probably included custom functions/procedures that invoke CPU-expensive code. I also thought this was a single-node scale-up architecture but since they are using a three-or-more node master/slave architecture, network I/O could somehow be involved.


On "cloud" servers we usually see that when disk I/O stops being the bottleneck (given enough disk I/O you can account for small amounts of RAM). So I'd guess their disk setup had some capacity left when they upgraded or they just have a lot of reads and the data fits in memory


I kind of did answer a different question, but you sort of affirmed my answer & proved my point. If the "expected range" of your CPU utilization is only 25%, then you are knowingly paying for 3x more CPU than you are actually every planning to use. I suppose if you got that cash money and are willing to burn it, nothing I can say will stop you. I just question the logic of it.

As for why they suddenly dropped? I'll leave that to someone who knows this particular system far better than I do.


I think Jeffbee resolved my confusion; InnoDB uses spin locks that chew up cpu cycles when the request rate exceeds the I/O rate. This is not as bad as it sounds; it uses extra power and generates extra heat but it is not doing any extra real work. InnoDB cores run at either 25% or 90% but the 90% indicates that you need an I/O upgrade not more CPU cycles.

Optimizing for CPU efficiency in a system that is I/O bound will not save significant money. Memory and NVMe are the critical factors and with the master/slave replication used, network I/O significantly undercuts the peak performance this single server is capable of.


> traditional OLTP row stores are I/O bound due to contention (locking and latching). Does anyone have an explanation for this?

I have seen CPU bound database servers when developers push application logic in to the database. Everything from using server-side functions like MD5() to needless triggers and stored procedures that could have been done application side.


Any MySQL with more than about 100 concurrent queries of the same InnoDB table is going to be CPU bound on locks. Their whole locking scheme doesn't scale; it's designed to look great in benchmarks with few clients.


It sounds like you just hit the threshold where defaults don't cut it anymore. With >100 concurrent clients you need to tune your DB for your workload.

innodb_thread_concurrency and innodb_concurrency_tickets would be a good starting point, and optimal values depend on your r/w balance and number of rows touched per type of query.


I'm saying the innodb buffer pool mutex doesn't scale over 100 contenders and you are saying that I can tune mysql so there's never more than that, which it seems to me like we're in agreement.


I did a quick search and it looks like InnoDB implements a spin-lock. Do you see increased CPU utilization when the buffer pool is overloaded? This could explain the behavior described in the article.


Correct, it uses an aggressive spinlock without fairness features for waiters like you'd see with a proper production-quality mutex. This makes it very efficient in the absence of contention (i.e. in trivial benchmarks) and extremely poorly behaved in times of contention.


I assumed they would spin for only a short period of time only then fallback to a real lock.

This is usually done to avoid blocking the thread before the end of it's time slice if there is a chance the lock would become available.

If instead they implement a pure spin loop with sleep I can see why it does not perform well.


would it scale better if we set innodb_sync_spin_loops=0 ?

Causing the lock to not spin and give up the pending CPU cycles back to OS if the lock is not available.


I am very interested. Could you provide more detail or a link about this. I would love to understand what they are doing that look good on benchmark but does not actually scale.


Indeed, at $PLACE_OF_WORK over the last 15 years a lot of logic was built in PL/SQL and dbcpu has become one of our most precious resources. For some applications its perfectly reasonable until you need to horizontally scale.


MySQL backup is more or less a “solved” issue with xtrabackup, and I assume that’s exactly what they are using with a database of this size.


I don't understand why they are trying so hard to avoid sharding. It seems to me that this is a perfect example of an "embarassingly parallel" problem for which sharding would be borderline trivial. What am I missing?


Well, their current solution is still a lot simpler, I suppose. With multiple servers there's a lot of extra administrative stuff you need to deal with.


It's also an example of a "mustn't fail or you break the internet" problem and a "lots of people with nation state resources have a reason to fuck with us" problem. They are prioritising simplicity as a means to security, and that makes sense to me.


Well sharding would mean that any outages affect only a small percentage of their users.


It still works like this and they have plenty of headroom with the new solution. Chances are good that when they need to upgrade this solution, technology will have advanced far enough as well. Guessing what their performance demands are (far, far more reads than writes) this seems to work fine for them, so why make it more complicated? They do talk about read-only replicas, so they have some distribution.


Rack space ain’t free


The cost of 4U vs 2U is trivial compared to the cost of this hardware.


Sharding is like amputating a limb ! It's works really well if the limb is infected with say a flesh-eating bacteria ! But you keep sharding as a last last resort ! It's hard going back


I'm more interested in how they used ZFS to provide redundancy. I always thought ZFS was optimized for spinning platters with SSD's used for persistent caching. In this scenario they used it to set up all their SSD's in mirrored pairs then stripe across that. No ZIL.

They've tweaked a few other settings as well [1].

I'd be curious to see more benchmarks and latency data (especially as they're utilizing compression, and of course checksums are computed over all data not just metadata like some other filesystems).

[1] https://github.com/letsencrypt/openzfs-nvme-databases


ZFS was started around 2001 when SSDs weren’t really a thing. It’s goals were, amongst other things, to manage multiple volumes (providing redundancy), to be reliable (most filesystems aim for this), and to support cheap snapshotting. The last two we’re supposed to come from being copy-on-write and this model was an advantage when SSDs became popular as it worked a bit better with their semantics.


ZFS is optimized for storing data, you can ZIL into an SSD but you can also ZIL into something faster.


What a great read. I think the authors here made great hardware and software decisions. OpenZFS is the way to go, and is so much easier to manage than the legacy RAID controllers imho.

Ah, I miss actual hardware.


I enjoyed it as well, i'm also appreciative that they shared their configuration notes here. I've been running multiple data stores on ZFS for years now and it's taken a while to get out of the hardware mindset (albeit you still need a nice beefy controller anyway).

https://github.com/letsencrypt/openzfs-nvme-databases


Can you explain the advantages of OpenZFS over other filesystems? I know FreeBSD uses ZFS, but I never really understood how it stacks up relative to other technologies...


As someone unfamiliar with db management, is it really less operational overhead to have to physically scale your hardware than using a distributed option with more elastic scalability capabilities?


Relational databases enable some very flexible data access patterns. Once you shard, you lose a lot of that flexibility. If you move away from a relational model, you lose even more flexibility and start having to do much more work in your application layer, and usually start having to use more resources and developer time every step of the way.

The productivity enabled by having one master RDBMS is a big deal, and if they can buy commodity servers that satisfy their requirement, this seems like a fine way to operate.


I agree this is an under appreciated strategy. Someone in my family worked for a hedge fund where one of their simple advantages was they just ran MS SQL on the biggest physical machine available at any given moment. Lots of complexity dodged by just having a lot of brute capacity.


If I had a billion dollars, I'd put a research group together to study the prospects of index sharding.

That is, full table replication, but individual servers maintaining differing sets of indexes. OLAP and single request transactions could be routed to specialized replicas based on query planning, sending requests to machines that have appropriate indexes, and preferably ones where those indexes are hot.


This has been done in popular commercial databases for decades, and is thoroughly researched. As far as I know, these types of architectures are no longer used at this point due to their relatively poor scalability and write performance. I don't think anyone is designing new databases this way anymore, since it only ever made sense in the context of spinning disk.

The trend has been away from complex specialization of data structures, secondary indexing, etc and toward more general and expressive internal structures (but with more difficult theory and implementation) that can efficiently handle a wider range of data models and workloads. Designers started moving on from btrees and hash tables quite a while ago, mostly for the write performance.

Write performance is critical even for read-only analytical systems due to the size of modern data models. The initial data loading can literally take several months with many popular systems, even for data models that are not particularly large. Loading and indexing 100k records per second is a problem if you have 10T records.


We pick up algorithms from 30 years ago all the time. Nobody’s as bad as the fashion industry, but we sure do try.

Part of it is short memories, but part of it is how the cost inequalities in our hardware shifts back and forth as memory or storage or network speeds fall behind or sprint ahead.


The problem is the network. You need billion dollars to fix the network so it's as fast as local ram/nvme.


... and even if you somehow solved that, the law of physics hits you hard. Latency can be a real performance killer generally and is doubly true in database-type computing.


Are you guys talking about single server, non redundant databases? That’s not even apples and oranges. More like watermelons and blueberries.


Why would that be the case? In this case we have already accepted that multiple servers will be involved. That means the limitations of the networking are a given.


Also worth noting that scalability != efficiency. With enough NVMe drives, a single server can do millions of IOPS and scan data at over 100 GB/s. A single PCIe 4.0 x4 SSD on my machine can do large I/Os at 6.8 GB/s rate, so 16 of them (with 4 x quad SSD adapter cards) in a 2-socket EPYC machine can do over 100 GB/s.

You may need clusters, duplicated systems, replication, etc for resiliency reasons of course, but a single modern machine with lots of memory channels per CPU and PCIe 4.0 can achieve ridiculous throughput...

edit: Here's an example of doing 11M IOPS with 10x Samsung Pro 980 PCIe 4.0 SSDs (it's from an upcoming blog entry):

https://twitter.com/TanelPoder/status/1352329243070504964


My mind exploded when realizing we can read random IO from disk at 40GB/s, which is faster than my laptop can read from RAM. https://spdk.io/news/2019/05/06/nvme/ Btw. using SPDK or io_uring?


> 16 of them (with 4 x quad SSD adapter cards) in a 2-socket EPYC machine can do over 100 GB/s.

It is more interesting if actual CPU can handle such traffic in context of DB load: encode/decide records, sort, search, merge etc.


Yes, with modern storage, throughput is a CPU problem.

And CPU problem for OLTP databases is largely a memory access latency problem. For columnar analytics & complex calculations it's more about CPU itself.

When doing 1 MB sized I/Os for scanning, my 16c/32t (AMD Ryzen Threadripper Pro WX) CPUs were just about 10% busy. So, with a 64 core single socket ThreadRipper workstation (or 128-core dual socket EPYC server), there should be plenty of horsepower left.


As I mentioned memory access latency - I just posted my old article series about measuring RAM access performance (using different database workloads) to HN and looks like it even made it to the front page (nice):

https://news.ycombinator.com/item?id=25863093


If the problem involves independent traversals, interleaving with coroutines is a practical way to hide latency https://dl.acm.org/doi/10.1145/3329785.3329917 https://www.linkedin.com/pulse/dont-stall-multitask-georgios...


We've made Scylla as async, shared nothing as possible, and we've also started adding C++20 coroutines (to replace futures/promises). We'll be doing more of that in 2021.

https://www.scylladb.com/2021/01/12/making-scylla-a-monstrou...


I plan to test Scylla (and Postgres with TimescaleDB) out soon after done with basic Linux tests :)


Just note that TimescaleDB is olap (columnar compression), scylladb is oltp with async-fsync and Postgresql is disk-fsync-on by default.


It can burst to millions of IOPS, but you get killed on the sustained write workload. Even a high end enterprise NVMe drive will be limited to around 60k IOPs once you exceed its write cache.


Yup indeed it's an issue with NAND SSDs - and it heavily depends on the type (SLC, MLC, TLC, QLC), vendor (controller, memory, "write buffer") sizes etc. I'm doing mostly read tests right now and will move to writes after.

The Samsung PRO 980 I have, are TLC for main storage, but apparently are using some of that TLC storage as a faster write buffer (TurboWrite buffer) - I'm not an expert, but apparently the controller can decide to program the TLC NAND with only 1-bit "depth", they call it "simulated SLC" or something like that. On the 1 TB SSD, the turbowrite buffer can dynamically extend to ~100 GB, if there's unused NAND space on the disk.

Btw, the 3DXpoint storage (Intel Optane SSDs & Micron X1) should be able to sustain crazy write rates too.


That really depends on your software.

Something like a NOSQL style it is kind of built in that it will be distributed. But that backs the compute cost back into the clients. Each node is 'crap' but you have hundreds so it does not matter.

Something like SQL server it comes down to how fast you can get the data out of the machine to clone it somewhere else (sharding/hashing, live/live backups, etc). This is disk, network, CPU. Usually in that order.

In most of the ones I ever did it was almost always network that was the bottleneck. Something like a 10gb network card (was state of the art neato at the time, I am sure you can buy better now) you were looking at saturation of 1GB per second (if you were lucky). That is a big number. But depending on your input transaction rate and how the data is stored it can drop off dramatically. Put it local to the server and you can 10x that easy. Going out of node costs a huge amount of latency. Add in the req of say 'offsite hot backup' and it slows down quickly.

In the 'streaming' world like kafka you end up with a different style and lots of small processes/threads which live on 'meh' machines but you hash it and dump it out to other layers for storage of the results. But this comes at a cost of more hardware and network. Things like 'does the rack have enough power', 'do we have open ports', 'do we have enough licenses to run at the 10GB rate on this router'. 'how do we configure 100 machines in the same way', 'how do we upgrade 100 machines in our allotted time'. You can fling that out to something like AWS but that comes at a monetary cost. But even virtual there is a management cost. Less boxes is less cost.


They may rely on the ACID properties of their database. Which makes everything simpler, easier, and safer.

https://dev.mysql.com/doc/refman/8.0/en/mysql-acid.html


I’m guessing that since it is for registration and all, the usage might be write-driven, or at least equally balance between writes and reads.

In addition, you really care about integrity of your data so you probably want serializability, avoid concurrency and potential write/update conflicts, and to only do the writes on a single server.

For this reason it sounds to me that partitioning/sharding is the only way to really scale this: have different write servers that care about different primary keys.


As stated in the post, there are read replicas. Assuming their workload is primarily reads, this buys them a decent amount of redundancy.


What are they storing on this server that requires 150Tb of storage and millions of IOPS?


> What exactly are we doing with these servers? Our CA software, Boulder, uses MySQL-style schemas and queries to manage subscriber accounts and the entire certificate issuance process.


There's nothing in that sentence that implies they'd need even 100 IOPS, much less 20 million.


The post doesn't specify requirements or application level targets for performance. They show a couple of good latency improvements but don't describe the business or technical impact. The closest we get is this.

> If this database isn’t performing well enough, it can cause API errors and timeouts for our subscribers.

What are the SLO's? How was this being met (or not) before vs after the hardware upgrade? There's a lot of additional context that could have been added in this post. It's not a bad post but instead it simply reduces down to this new hardware is faster than our old hardware.


What exactly needs to be stored once the certificate is created and published in the hash tree? It seems like the kind of data that possibly needn't be stored at all or onto something like Glacier for archival.


Going to guess it's for OCSP responses.


FYI, the intermediate CA's signed by their new Root X2 certificate won't have OCSP URLs anymore.

Source: https://letsencrypt.org/2020/09/17/new-root-and-intermediate...


AFAIK, nobody has suggested removal of OCSP from end-entity certificates. This article you linked (and the comment you wrote) is purely about removal from intermediate CA certificates.

The majority of OCSP traffic will probably be for end-entity certificates; most OCSP validation (in browsers and cryptographic libraries) is end-entity validation, not leaf-and-chain.

Removal of intermediate CA's OCSP is probably not really relevant to their overall OCSP performance numbers (and if it was, it was likely cached already).


There's an argument for not doing OCSP on end-entity certificates if you can approach the lifetime for the certificates that you'd realistically need for OCSP responses anyway.

Suppose you promise to issue OCSP revocations within 48 hours if it's urgent, and your OCSP responses are valid for 48 hours. That means after a problem happens OCSP revocation takes up to 96 hours to be effective.

If you only issue certificates with lifetimes of 96 hours then OCSP didn't add anything valuable - the certificates expire before they can effectively be revoked anyway.

Let's Encrypt is much closer to this idea (90 days) than many issuers were when it started (offering typically 1-3 years) but not quite close enough to argue revocation isn't valuable. However, the automation Let's Encrypt strongly encourages makes shortening lifetimes practical. Many of us have Let's Encrypt certs automated enough that if they renewed every 48 hours instead of every 60 days we'd barely care.

The solution to excessive OCSP traffic and privacy risk is supposed to be OCSP stapling instead, but TLS servers that can't get stapling right are still ridiculously popular so that hasn't gone so well.


I'm not sure, e.g. Chrome doesn't do OCSP by default, lots of embedded clients like curl won't either. Unless the protocol is terribly broken, that also seems like the kind of use case where 99% of queries just come out of cache and should never hit a database.


Let's Encrypt still has to publish OCSP responses for every non-expired leaf certificate, at least in time that you can always get a new OCSP response before the previous one expires. In practice they have a tighter schedule so that there's a period between "We are not meeting our self-imposed deadline" and "The Internet broke, oops" in which staff can figure out the problem and fix it.

To do this they automatically generate and sign OCSP responses (the vast majority of which will just say the certificate is still good) on a periodic cycle, and then they deliver them in bulk to a CDN. The CDN is who your client (or server if you do OCSP stapling, which you ideally should) talks to when checking OCSP.

To generate those responses they need a way (hey, a database) to get the set of all certificates which have not yet expired and whether those certificates are revoked or not.


My thoughts exactly ... They are creating and storing text. Only thing I can think is that they don't actually need the storage, but just want the lowest possible latency by having a large number of drives.


Well if Latency were an issue or something they like to lower they should have chosen Optane.


Let's say they maxed out their former storage capacity of 24x3.8[TB]. For these 235millions of websites as they claim it's around 400kB per website.


>Intel® SSD DC P4610

Interesting they decide to put in PCIE 3.0 NVMe SSD instead of PCI-E 4?

Imagine having 24x Intel Optane [1]. PCI-E 5 is actually just around the corner. I would imagine next time Let's Encrypt could upgrade again and continue to use a Single DB Machine to Serve.

[1] https://www.servethehome.com/new-intel-optane-p5800x-100-dwp...


Dirty secret when it comes to current PCIe 4 drives: they have optimised sequential RW speed too much that they have forgotten random RW speed (which is the main driver in databases).


I'm curious why they didn't go with the larger 64 core Epyc. I mean it's double the cost, but I suspect that the huge amount of NVMe SSDs is by far the largest part of the cost anyway. And it seems like CPU was the previous bottleneck as it was at 90%.


We didn't go with the 64-core chips because they have significantly lower clock speeds.

Dual 32-core chips give us plenty of cores while keeping clocks higher for single-threaded performance.

You are correct that the price of the CPUs is almost irrelevant to the overall cost of a system with this much memory and storage. We were picking the ideal CPU, not selecting on CPU price.


Thanks for the answer. I would have guessed that the higher core count outweighs the lower frequence for database usage, but obviously I don't know the details. I think the 90% CPU usage graph just made me nervous enough to want the biggest possible CPU in there.


ooi how is the NUMA on that setup? Does it still use QPI or is there a newer technology now (I've been out of this space for a few years now)


This is a very high level overview and ideally I would have liked to have seen more application level profiling, e.g. where time is being spent (be it on CPU or IO) within the DB rather than high level system stats. For example the following.

> CPU usage (from /proc/stat) averaged over 90%

Leaves me wondering exactly which metric from /proc/stat they are refering to. I mean it's presumably its user time, but I just dislike attempts to distill systems performance into a few graph comparisons. In reality the realized performance of a system is often better described given a narrative describing what bottlenecks the system.


I’d like to understand more about the workload here. Queries per second, average result size, database size, query complexity etc.


Unless I misunderstood something, it seems they have a single primary that handles read+write and multiple read replicas for it.

It shouldn't be too difficult given the current use of MariaDB to start using something like Galera to create a multi-master cluster and improve redundancy of the service, unless there are some non-obvious reasons why they wouldn't be doing this.

I think I also see redundant PSUs, would be neat to know if they're connected to different PDUs and if the networking is also redundant.


That's still a very common pattern if you need maximum performance, and can tolerate small periods of downtime. When designing systems, you have to accept some drawbacks. You can forgo a clustered database if you have a strong on call schedule, and redundancy built in to other parts of your infrastructure.

Galera is great, but you lose some functionality with transactions and locking that could be a deal breaker. And up until MySQL 8, there were some fairly significant barriers to automation and clustering that could be a turn off for some people.

Everything has it's pros and cons.


Multi-master hardly comes for free in terms of complexity or performance, you're at the mercy of latency. Either host the second master in the same building, in which case the redundancy is an illusion, or host it somewhere else in which case watch your write rate tank

Asynchronous streaming to a truly redundant second site often makes more sense


How well would same city with fiber between work?


"servers" in title but it seems that it's single. Is it correct? (Just curious, I'm non native for English)


One primary, which they replicate to other servers in other locations.

https://github.com/letsencrypt/openzfs-nvme-databases


>> We currently use MariaDB, with the InnoDB database engine.

It is kind of funny how long InnoDB was the most reliable storage engine. I am not sure if MyISAM is still trying to catch up, it used to be much worse than InnoDB. With the emergence of RocksDB there are multiple options today.


The only thing I ever used MyISAM tables for was for storing blobs and full-text search on documents. If your data is mostly read only then it's a decent option out of the box. But if you do even mildly frequent updates then you'll quickly run into problems with the table level locking instead of the row level locking offered by InnoDB


MyISAM days are gone, no one will seriously consider it as suitable engine in MySQL.


It is quite likely unless you've meticulously avoided it that MySQL is using ISAM on-disk temp tables in the service of your queries.


Actually, that shouldn't be the case since 5.7: https://dev.mysql.com/doc/refman/5.7/en/server-system-variab... (and related: https://dev.mysql.com/doc/refman/5.7/en/server-system-variab...)

And on 8 MyISAM is mostly gone, not even the 'mysql' schema uses it.

Edit: Originally linked only to default_tmp_storage_engine).


I didn't know that, thanks!


Surprised something that key to the internet runs off a single server.


@jaas I wonder how many servers you have and in particular how a failure of a system would play out in your situation.

The latter is always interesting to me, building things is easy. Building rock solid reliable things is seriously hard work and often difficult.

Would make a great blog post....


What form factor are those NVMe drives and how are the connected? I see cables, so I'm assuming they're not all plugged straight into their own PCIe slot. Are there a bunch of M.2 headers on the motherboard?


U.2. The drives have PCIe lanes connected via the blue cables to the motherboard, which is then wired to the CPU. Here's a review that goes into details on the hardware: https://www.servethehome.com/dell-emc-poweredge-r7525-review...


Almost definitely U.2 drives that slot into the front of the server. Here's a cartoony view of what that looks like (on a P5800X Optane SSD): https://www.servethehome.com/new-intel-optane-p5800x-100-dwp...


It would be interesting to see your database schema, Let's Encrypt!


Lets for the sake of argument assume that Lets Encrypt is a malicious actor. Can they easily compromise the security of the websites using their certificates?


Yes. Not just websites using their own certificates though. As a certificate authority they can create certificates for arbitrary domains. There are however a few countermeasures against illegitimate certificates such as certificate pinning and certificate transparency.


Yes, but they better not get caught. It's a trust based model. It's actually Internet stack developers/packagers (anything from protocols in OS or libraries to browsers and devices) that are trusting Let's Encrypt among other certificate authorities.


+10% for the proper use of decimated


You mean, that one? https://en.wikipedia.org/wiki/Decimation_(Roman_army)

Then it would be 90 ms -> 81 ms, not 90 ms -> 9 ms. The way I see it, at least. With proper decimation, 90% of what was there remains. ("removal of a tenth", as wikipedia puts it).


Oh man, you’re totally right. I ... uh, blame somebody else.


So they've completely ignored the lesson Google taught us all 20+ years ago -- lots of cheap servers scales better than an ever more expensive big iron SPOF setup -- and which we're now all using to scale the internet. It's unbelievable to see a design like this in 2021.

What a stellar bit of incompetence.


I wonder how much do they save by not going public cloud?


So scaling up instead of scaling out. I’m not sure if it’s a viable strategy long term, at the same time we probably don’t want a single CA to handle too many certificates?


Scaling up means each query is faster (3x in this particular case). Scaling out means they can support more clients/domains (more DB shards, more web servers, more concurrency, etc).

These are two distinct axes that are not incompatible with each other.


Not always true though, scaling out can make your queries faster by alleviating the load


I'm not aware of any other CA giving out free certificates to anyone. I know that some other providers/hosts will do free certificates, but only to their users (last time I checked).


those EPYC CPUS are really epic


gets huge server. does not properly resize the jpeg on the page (it’s 5megs in size and i see it loading). we don’t al have 5TB of ram you know


Fixed!


:)


I'm guessing someone out there's thinking: Why aren't they hosting in the cloud? The cloud being either Amazon or Azure. Surely nothing else exists. Is it really possible to host your own PHYSICAL machine? Does that count as the cloud?!


First, this made me giggle because I run into that attitude all the time. "You're hosting things on a SERVER? Why would anyone do THAT? Heck, you should be putting everything in serverless and avoiding even the vague possibility that you would have to touch anything so degrading and low-class as an operating system. Systems administration? Who does that?"

In all seriousness, however, the decision (likely) has very little to do with that. They're most likely not hosting in the cloud because the current CA/Browser Forum rules around the operation of public CAs effectively don't permit cloud hosting. That's a work in progress, but for the time being, the actual CA infrastructure can't be hosted in the cloud due to security and auditability requirements.


For a service like letsencrypt, the independence factor is also a major reason for self hosting.

I can forsee letsencrypt in the future going to building their own cloud (on their own physical infrastructure), but speaking as a letsencrypt user of their free certificate program, I would lose respect and interest in their service if they went with an AWS or GCP or Azure approach.

The independence from other major players (and the ability of their team to change and move everything about their service, as needed) is one of the reasons I use letsencrypt.


Funny you mention AWS as they're one of the corporate sponsors of LE.

So long as they don't have a viable independent revenue stream they're arguably less independent than commercial CAs.


"one of" being the key point here. Let's encrypt has a huge number of sponsors (AWS being only 1 of 9 even if you only count the "platinum level" sponsors), which should allow them to maintain their independence.

https://letsencrypt.org/sponsors/


Yes it's still possible to put your own physical machines in a datacentre.

For example : https://www.scaleway.com/en/dedibox/dedirack/

I'm not sure you can say it's the cloud though.

They are not hosting their database in the cloud like Amazon or Azure because no cloud provider offers such high performances at a comparable price. Actually I'm not even sure you can get a cloud VM with that many IOs, if you don't mind the pricing.


We have a server with the hostname "cloud.example.com".

It can help if someone wants their data "in the cloud".


They are a Public CA and they must undergo a third party compliance audit to operate. The conditions are such you cannot really pass if your infra is in any of those public clouds.


Not sure if you're being sarcastic but couple of mil a year on AWS I rekon for anything similar, vs a one off 200k, not a bad saving.


And a sizable internet bill I'd assume. This puppy ain't running on Gigabit.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: