I was, long ago, an old-school Unix sysadmin. While I was technically aware of how powerful smallish servers have become, this article really crystallized that for me.
64 cores and 24 NVME drives in a 2U spot on a rack is just insane compared to what we used to have to do to get a beefy database server. And it's not some exotic thing, just a popular mainstream Dell SKU.
If you price it out on Dell's site, you get a retail price north of $200k. That is really what made it clear for me. That you could fit $200k+ worth of DIMMS, Drives, CPUS into a 2U spot :)
If you buy through a VAR and/or Dell reps you don't pay the price on the website. What you actually pay is typically significantly lower. I don't think anyone actually buys servers like these by just ordering from the website. We (Let's Encrypt) certainly don't.
These are expensive servers, crossing into six digits, but not $200k.
I wonder if such “price on application” behaviours is partially what drives services like AWS
I’m looking at buying some fortigate firewalls to do some natting, looks like 200Fs will be fine, but even the fortigate sales guys won’t give me a price, I have to go to resellers who also refuse to give prices, which adds friction. Cisco exactly the same.
When looking at options, price is at the forefront of my mind, but sales guys want me to choose their company, and even commit to the specific device and numbers, before I even see the price.
Almost walked away from the fortigate option until I found avfirewalls which gave me a ballpark idea of what I could afford to implement, and what trade offs. The benefits of the fortigate over a mikrotik were worth it at that price, but it was painful getting the price out and they nearly lost the sale as I assumed it would be 10 times more.
- Hospitals have an interest in massively inflated high sticker prices to make as much money as possible from those who pay in cash, e.g. foreigners, and to make the debts look better for collection agencies - basically, assuming the true cost of a procedure is 1000$, when the hospital bills 1000$ and that goes to collections who buy the debt for 10% of value the hospital gets 100$, and if the hospital bills 10k $ and the collections agency pays 10%, the hospital gets 1000$ or the actual cost
- Insurances have an interest in high sticker prices because they will negotiate with the hospital to pay true cost + some markup anyway - so with a higher sticker price, they can claim that their insurance saves the buyers a higher percentage
- Employers have an interest in high sticker prices because they can market themselves as employers who provide a better health insurance than the competition
The people losing out in this gamble are those who cannot afford insurance and have to declare medical bankruptcy.
And in the case of hardware or even some "contact sales for a quote" SaaS it's the same end result: the ones who lose out are small businesses (who can't achieve the sales amount to qualify for cheap-ish rates), and the big companies with dedicated account managers have a nice life.
This is Anchoring, a cognitive bias where an individual depends too heavily on an initial piece of information offered to make subsequent judgments during decision making.
At this size/scale I wonder what the actual price would be for a comparable Supermicro system. If I had to make a wild guess, well under half the 200k previously quoted.
after 15 seconds of googling, the P4610 U.2 format 6.4TB seems something more like a single unit street price of $2400 from non-Dell vendors. I'm mildly surprised it's that low considering that the U.2 format stuff, for serious servers, will always command a premium price.
Probably in the range of $2100 to $2200 per unit from a x86-64 component distributor in moderate quantities.
It looks like performance-wise I can compare it to a couple of 970 EVOs stuck together, and those are $160/TB retail. So from that point of view you're paying more than double.
Well, now I have to ask an old boss if he called a rep, he didnt spend quite 100k+ but we did get a twenty grand single server from dell about half a decade back or more, and now I'm curious, I think it was about 20 grand, I don't think NVMe's were mainstream at the time, or anywhere damn near affordable. I suspect he did buy it directly off the site, as he has other hardware we'd have gotten (mainly workstations).
In a way you could say the ones not calling a rep are subsidizing your cost for the server.
One of the interesting things about threadripper and epyc systems with a lot of PCI-E 4.0 lanes on one motherboard, if your goal is not NVME disk I/O, but rather network throughput, it also provides the opportunity for a DIY approach to very high capacity multi-100GbE ethernet interface routers.
Such as a fairly low cost 4U threadripper box, running VyOS (ultimately based on a very normal debian Linux) with multiple Intel 100GbE NICs in PCI-E slots.
The Intel linux kernel (and FreeBSD) driver support for those is generally excellent and robust. Intel has had full time developers doing the Linux drivers for that series of cards since the earliest days of their 64-bit/66MHz PCI-X 10Gbps NICs about 17 years ago.
Also worth mentioning that FRR is now an official Linux Foundation supported project.
Altavista used to run their entire search engine on a single computer twenty years ago. Imagine what Mike Burrows could accomplish today with one of these 2U EPYC NVME 200gbps PCs. It'd probably trigger the singularity.
For a comparison, you could get a 64 processor, 256Gb ram Sun Starfire around 20 years ago[1]. Wikipedia claims these cost well over a million dollars ($1.5 million in 2021 dollars). This machine was enormous (bigger than a rack), would have had more non-uniform memory access to deal with, and the processors were clocked at something like 250-650MHz.
That's a pretty good comparison. It weighed 2000lbs, and was 38 inches wide, or basically 2 full racks, which I guess you could call an 84U server. It was also 49 inches deep, versus a standard rack which is 36 inches deep.
I have a motherboard from 2012 and I just put 2x 8TB NVMe SSDs on it, on a PCIe 2.0 x16 slot
Works great. The PCIe card itself has 2 more slots for SSDs
The GPU is on the 2.0 x8 slot because they don't really transfer that much data over the lanes.
I honestly didn't realize PCIe was up to 4.0 now, and I am pushing up against the limits of PCIe 2.0 but it still works! And I’m “only” at the limits, and its only a limit when I want faster than 3,000 megabytes per second, which is amazing.
Granted, this would have been considered a good enthusiast motherboard in 2012. Buying new but cheap is the mistake.
I have a hunch that the pcie card itself is most important as it is doing bifurcation.
So each drive acts like it has its own slower (but fast enough) pcie slot, and then the raid0 combines the bits back to double the performance.
Could be wrong but I get 2,900 megabytes per second transfers from RAM to disk and back.
And this is PCIe 2.0 x16
so maybe if you want , 3,000, 4,000 or 6,500 megabytes per second then I have nothing to brag about. I’m pretty amazed though and will be content for all my use cases.
STH recently did a video about this dense mass of storage & compute they call a server. iirc they specced it with their discounts to end up at a much more reasonable ~$90k.
We profiled Supermicro vs. Dell vs. HP vs. IBM servers a decade ago for trading applications.
Supermicro was cheaper by 20-30% had a quantitatively better power profile which basically meant it paid for itself and just rocked on every dimension. And their sales reps were able to answer every question we had including sending us a excel model of their power usage. The dell folks never got back to us on power usage.
I'm surprised that's what did it for me. I admined some sun enterprise servers and was blown away when I bought a pi 3 yrs and realized the pi was probably faster than it. That enterprise server would have 100-200 engineering students logged in and working at once. :-/. We're living in the future.
If we're talking about 90s or early 2000s Sun, probably. Even then, though, these systems probably had substantially better I/O performance.
I have a SPARC laptop lying around from 1995 that gets a whopping 20 MB/s in disk read/write speeds across its dual half height SCSI drives. That still beats all but the best SD cards.
The full size systems with SAS and other options would get even better disk performance.
I think the first servers I had in production had 8 MB RAM. No more, certainly. Soon we'll be at 1000x that. My dad's first "server" was 3 orders of magnitude smaller, with 8 KB of RAM (hand-wound wire core memory). In that time, the US population hasn't even doubled.
We made so much progress in that field over the last 40 years... I sometimes get lost trying to imagine what will be the computing performance available in 100, 1000, 10000 years from now...
That doesen't seem like some fundamental restriction, but rather a compromise we're always willing to make. 30ms is not noticeable, so we don't try to lower it and sacrifice something else.
Even the super slow ones like on kindle, are _choices_ that have been made in favor of something else. a second to turn a page on a book isn't unbearable
30ms latency is acceptable. But that's the best we ever had.
High-end Android phones have 120+ ms latecy. That's easily noticeable and actually annoying (at least to me).
My personal pet peeve is the latency of input when i'm starting new applications in KDE. E.g. I'll start a new terminal with Super+Enter, followed by a Super+Right Arrow in order to tile it to the right. But the latency is big enough, that often it's not the terminal that ends up tiled, but the application that had focus earlier, e.g. a web browser. It's really annoying.
I also don't understand how still in 2020 when I move a window with the mouse, the window can't keep up with the mouse.
30ms input lag is what we should work towards. But that's not what we have today. Today is actually crap.
It’s so frustrating that our phones and computers can’t move at the speed our mind moves. I know it could if someone cared to try to make it a priority.
I think that if you had a top-to-bottom approach to a Linux distro designed entirely for latency of user interaction, you'd get something that would keep up with you.
I keep meaning to try out WindowMaker again as my Fedora window manager. I feel like it would be incredibly speedy on today's hardware.
The one thing that I get hung up on when it comes to RAID and SSDs is the wear pattern vs. HDDs. Take for example this quote from the README.md:
We use RAID-1+0, in order to achieve the best possible performance without being vulnerable to a single-drive failure.
Failure on SSDs is predictable and usually expressed with Terabytes Written (TBW). Failure on spinning disk HDDs is comparatively random. In my mind, it makes sense to mirror SSD-based vdevs only for performance reasons and not for data integrity. The reason is that the mirrors are expected to fail after the same amount of TBW, and thus the availability/redundancy guarantee of mirroring is relatively unreliable.
Maybe someone with more experience in this area can change my mind, but if it were up to me, I would have configured the mirror drives as spares, and relied on a local HDD-based zpool for quick backup/restore capability. I imagine that would be a better solution, although it probably wouldn't have fit into tryingq's ideal 2U space.
> Failure on SSDs is predictable and usually expressed with Terabytes Written (TBW). Failure on spinning disk HDDs is comparatively random.
That wasn't my experience with thousands of SSDs and spinning drives. Spinning drives failed more often, but usually with SMART sector counts increasing before hand. Our SSDs never got close to media wearout, but that didn't stop them from dropping off the bus. Literally working fine, then boom can't detect; all data gone.
Then there's the incidents where the power on hours value rolls over and kills the firmware. I believe these have happened on disks of all types, but here's a recent one on SSDs [1]. Normally when building a big server, all the disks are installed and powered on at the same time, which risks catastrophic failure in case of a firmware bug like this. If you can, try to get drives from different batches, and stagger the power on times.
> Failure on spinning disk HDDs is comparatively random.
comparatively, yes, but when averaged out over a large number of hard drives it definitely tends to follow a typical bathtub curve failure model seen in any mechanical product with moving parts.
State of the art systems keep ~1.2 copies (e.g. 10+2 raid 6) on SSD, and an offsite backup or two. The bandwidth required for timely rebuilds is usually the bottleneck.
These systems can be ridiculously dense; a few petabytes easily fits in 10U. With that many NAND packages, drive failures are common.
An early mitigation strategy has been to use different brands of ssd so that failure rates get more spread out. For raid 6 we have started out with max 2 drives of a single brand.
The result for us was 2 drives that failed within the same month, of the same brand, and from there it seems to be single failures rather than clusters.
The tuning is almost identical to what we have in production. A few comments:
* I would probably go with ashift=12 to get better compression ratio, or even as far as ashift=9 if the disks can sustainably maintain the same performance. Benchmark first, of course.
* We came to the same conclusion regarding AIO recently, but just today I did more benchmark, and it looks like ZFS shim does perform better than InnoDB shim. So I think it's still fine to enable innodb_use_native_aio
* We use zstd compression from ZFS 2.0. It's great, and we all deserve it after suffering through the PR dramas.
Based on their stated 225M sites and a renewal period of 90 days, they're probably averaging around 40 certificates per second. That's only an order of magnitude higher than bitcoin; I wouldn't call it an indication of an ability to scale to a particularly large amount of traffic.
They might average that, but we all know averages only work on paper. For example, AWS has tutorials that provide instructions on how to setup TLS in a LAMP stack running on Linux 2 EC2s. As part of the Let's Encrypt setup, they provide a crontab entry that runs twice a day with a copy option to paste. How many EC2s all hit the Let's Encrypt server at that exact time? Since EC2s default to UTC time, that means that servers are not offsetting those requests by timezones, so that means an even bigger spike.
as the example crontab entry. Though, for maximum friendliness to Let's Encrypt that should probably be something like "~ ~ * * ~", which would run the command at a random time once per week. I think you could accomplish something similar using systemd timer units and RandomizedDelaySec, assuming it permits a delay as high as 604800.
You probably should not run it only once a week. That means it takes up to a whole week to even attempt to automatically fix a problem, which is a long time. True that's going to be four attempts before, finally, the certificate actually expires (if the problem is just intermittent service availability), but it doesn't seem worth the small additional risk.
Now, sure, in principle you should have active monitoring so you'd know immediately if there's a problem status e.g. certificate with only 14 days left until expiry; revoked, expired or otherwise bad certificate presented by server; OCSP staples missing. But we know lots of people don't have monitoring, and I guarantee at least one person reading this HN thread is mistaken and isn't monitoring everything they thought they were.
Like Certbot acme-client does a local check before taking any action, so if run once a day (or indeed once an hour) it will not normally call out to the Let's Encrypt service.
Unlike Certbot acme-client doesn't do OCSP checks, so it won't even talk to the network to get an OCSP response. Even with Certbot this (OCSP) is provided through a CDN (so it's roughly similar cost to fetching a static image for a popular web site) and so your check is negligible in terms of operating costs for Let's Encrypt.
Its worth noting that running certbot / LE twice a day doesn't actually hit the LE server twice a day. It just checks the certificate dates locally and if they have been renewed in the last month it does nothing.
I guess you still have a peak time of 00:00 UTC every day though unless people are using servers set to their local time.
I believe modern Certbot also does OCSP. The intent here is, if for any reason your certificate was revoked it makes sense to try to obtain a new certificate even if it hasn't expired. Even if it can't, perhaps the operator will notice a logged reason from Certbot when they did not yet notice their certificate was revoked.
Examples of reasons your certificate might have been revoked:
* You re-used a private key from somewhere, perhaps because you don't understand what "private" means, and other copies of that key leaked
* You didn't re-use the private key but your "clever" backup strategy involves putting the private key file in a public directory named /backup/server/ on your web server and somebody found it
* You use the same certificate for dozens of domain names in your SEO business and yesterday you sold a name for $10k. Hooray. The new owner immediately revoked all certificates for that name which they're entitled to do.
* Your tooling is busted and the "random" numbers it picked aren't very random. (e.g. Debian OpenSSL bug)
* A bug at Let's Encrypt means their Ten Blessed Methods implementation was inadequate for some subset of issuances and rather than cross their fingers the team decided to revoke all certificates issued with the inadequate control.
* Let's Encrypt discovers you're actually a country sanctioned by the US State Department, perhaps in some thin disguise such as an "independent" TV station in a country that doesn't have any independent media whatsoever. It is illegal for them to provide you with services and you were supposed to already know that.
So that is a network connection, but not to the Let's Encrypt servers described in this story.
In practice OCSP is done by a big CA by periodically computing OCSP responses for every single unexpired certificate (either saying it's still valid, or not), and then providing those to a CDN and the CDN acts as "OCSP server" returning the appropriate OCSP response when asked without itself having possession of any cryptographic materials.
Yes. They are not doing a very heavy computational workload. Typical heavy-duty servers these days can do 100k's or millions of TPS. 40 TPS is a really, really, really light load.
Further, I was looking at those new server specs. There's an error I think? The server config on the Dell site shows 2x 8 GB DRIMMs, for 16 GB RAM per sever, whereas the article says 2 TB!
With only 16GB of RAM, but 153.6 TB of NVMe storage, the real issue here is memory limitation for a general-purpose SQL database or a typical high-availability NoSQL database.
Check my math: 153600 GB storage / 16 GB memory = 9600:1 ratio
Consider, by comparison that a high data volume AWS i3en.24xlarge has 60TB of NVMe storage but 768 GB of RAM. A 78:1 ratio.
If the article is correct, and the error is in the config on the Dell page (not the blog), and this server is actually 2 TB RAM, then that's another story. That'd make it a ratio of 153600 / 2000 = ~77:1.
Quite in line with the AWS I3en.
But then it would baffle me why you would only get 40 TPS out of such a beast.
Why are you assuming that their workload includes just one query per emitted certificate?
The reality is that they are storing information during challenges, implementing rate limiting per-account, supporting OCSP validation and a few other things.
You can investigate further if you really want to see the queries that they make against the database since their software (Boulder) is open source [1]. Most queries are in the files in the "sa" (storage authority) folder.
Don’t these certificates have long RSA keys which are more expensive computationally? Though I guess that doesn’t have to happen on the database server.
The only RSA computations Let's Encrypt need to do are:
* Signature by their RSA Intermediate (currently R3, with R4 on hot standby) - which will be a dedicated piece of hardware - to issue a subscriber's certificate. In practice this happens twice, as a poisoned pre-certificate to obtain proof of logging from public log servers, and then the real certificate with the proofs baked inside it.
* Signatures by their OCSP signer periodically on an OCSP response for each certificate saying it's still trustworthy for a fixed period. Again this will be inside an HSM.
* Signature verification on a subscriber's CSR. To "complete the circuit" it's helpful that Let's Encrypt actually confirms you know the private key corresponding to the public key you wanted a certificate for, the signature on your CSR does this. Some people don't think this is necessary, but I believe Let's Encrypt do it anyway.
You're correct that none of this happens on the database servers. I guess it's possible their servers use TLS to secure the MariaDB connections, in which case a small amount of either ECDSA or RSA computation happens each time such a connection is set up or torn down like at any outfit using TLS, but those database connections are cached in a sane system so that wouldn't be very often.
Does "certbot renew" talk to the mothership at all if no certs are ready for renewal? If it does, most setups I've seen run the renewal once or twice a day since it only does the renew when you're down to 30 days left. There may also be some OCSP related traffic.
But the cryptographic bottleneck is in HSMs, not in database servers (database servers don't generate the digital signatures, they just have to store them after they've been generated).
Bitcoin is actually not limited by compute power at all. Its an artificial cap on transaction rate to prevent the blockchain from expanding too large and preventing normal users from hosting the whole thing.
You can see the blockchain size was growing exponentially but then switches to linear as we hit the transaction cap and it now sits at about 350GB
Read performance is much easier to scale (in one box or several) than write performance. It's usually the writes that make you look at Cassandra and similar, instead of adding more disks and RAM, or spinning another read-only replica.
24 NVMEs should have a lot of write throughput, though.
That's the wrong way to think about the cloud. A better way to think about it would be "how much database traffic (and storage) can I serve from Cloud Whatever for $xxx". Then you need to think about what your realistic effective utilization would be. This server has 153600 GB of raw storage. That kind of storage would cost you $46000 (retail) in Cloud Spanner every month, but I doubt that's the right comparison. The right math would probably be that they have 250 million customers and perhaps 1KB of real information per customer. Now the question becomes why you would ever buy 24×6400 GB of flash memory to store this scale of data.
Checking out AWS side, the closest I think you'd get is the x1.32xlarge, which would translate to 128 vCPU (which on intel generally means 64 physical cores) and close to 2TB of RAM. nvme storage is only a paltry 4TB, so you'd have to make up the rest with EBS volumes. You'd also get a lower clock speed than they are getting out of the EPICs
spittakes reading the suggestion of replacing NVMe with EBS
I mean, yeah, I guess you can. But a lot depends on your use case and SLA. If you need to keep ultra-low p99s — single digits — then EBS is not a real option.
But if you don't mind latencies, then yeah, fine.
Don't get me wrong: EBS is great. But it's not a panacea and strikes me as a mismatch for a high performance monster system. If you need NVMe, you need NVMe.
It wasn't a suggestion of replacement, it was saying that even the closest instance still doesn't get you 24TB of on-host storage, having to use EBS instead.
This starts going down a rabbit hole of tiered storage then. Because unless you design for that, you can only be as fast as your slowest storage. It'd be like tying a 20 kg weight to the leg of a sprinter otherwise.
Stopping by to say, 9ms API response time is just ridiculously quick. You're starting to run into the laws of physics and client proximity to the datacenter where those machines live. That's a pretty amazing feat. I would assume the next step for scaling is getting those read replicas deployed across the world in order to cut down on RTT.
9ms 50%ile node latency is good, but that number is normally 1ms for big internet services. See https://dl.acm.org/cms/attachment/3658918e-7081-4676-beec-aa... Mission critical stuff goes even faster like BigTable which has numbers 4x better than the figure.
The 9ms is one indicator that the new hardware platform has ample extra capacity for future growth in load and traffic, it probably won't need to be replaced or upgraded for some years.
> We can clearly see how our old CPUs were reaching their limit. In the week before we upgraded our primary database server, its CPU usage (from /proc/stat) averaged over 90%
This strikes me as odd. In my experience, traditional OLTP row stores are I/O bound due to contention (locking and latching). Does anyone have an explanation for this?
> Once you have a server full of NVMe drives, you have to decide how to manage them. Our previous generation of database servers used hardware RAID in a RAID-10 configuration, but there is no effective hardware RAID for NVMe, so we needed another solution... we got several recommendations for OpenZFS and decided to give it a shot.
Again, traditional OLTP row stores have included a mechanism for recovering from media failure: place the WAL log on separate device from the DB. Early MySQL used a proprietary backup add-on as a revenue model so maybe this technique is now obfuscated and/or missing. You may still need/want a mechanism to federate the DB devices and incremental volume snapshots are far superior to full DB backup but placing the WAL log on a separate device is a fantastic technique for both performance and availability.
The Let's Encrypt post does not describe how they implement off-machine and off-site backup-and-recovery. I'd like to know if and how they do this.
> The Let's Encrypt post does not describe how they implement off-machine and off-site backup-and-recovery. I'd like to know if and how they do this.
The section:
> There wasn’t a lot of information out there about how best to set up and optimize OpenZFS for a pool of NVMe drives and a database workload, so we want to share what we learned. You can find detailed information about our setup in this GitHub repository.
> Our primary database server rapidly replicates to two others, including two locations, and is backed up daily. The most business- and compliance-critical data is also logged separately, outside of our database stack. As long as we can maintain durability for long enough to evacuate the primary (write) role to a healthier database server, that is enough.
Which sounds like traditional master/slave setup, with fail over?
> Which sounds like traditional master/slave setup, with fail over?
Yes, thank you. I assumed that the emphasis on the speed of NVMe drives meant that master/slave synchronous replication was avoided and asynchronous replication could not keep up. In my mind, this leaves room for interesting future efficiency/performance gains, especially surrounding the "...and is backed up daily" approach mentioned in your quote.
The bottom line is that the old RPO (Recovery Point Objective) and RTO (Recovery Time Objective) are as important as ever.
> This strikes me as odd. In my experience, traditional OLTP row stores are I/O bound due to contention (locking and latching). Does anyone have an explanation for this?
Let me boil it down to a few points; some beyond Avi's talk:
• Traditional RDBMS with strong consistency and ACID guarantees are always going to exhibit delays. That's what you want them for. Slow, but solid.
• Even many NoSQL databases written (supposedly) for High Availability still use highly synchronous mechanisms internally.
• You need to think about a multi-processor, multi-core server as its own network internally. You need to consider rewriting everything with the fundamental consideration of async processing, even within the same node. Scylla uses C++ futures/promises, shared-nothing shard-per-core architecture, as well as new async methods like io_uring.
• Between nodes, you also have to consider highly async mechanisms. For example, the tunable eventual consistency model you'd find in Cassandra or Scylla. While we also support Paxos for LWT, if you need strong linearizability, read-before-write conditional updates, that comes at a cost. Many classes of transactions will treat that as overkill.
• And yes, backups are also a huge issue for those sorts of data volumes. Scylla, for example, has implemented different priority classes for certain types of activities. It handles all the scheduling between OLTP transactions as highest priority, while allowing the system to plug away at, say, backups or repairs.
More on where we're going with all this is written in a blog about our new Project Circe:
But the main point is that you have to really think about how to re-architect your software to take advantage of huge multi-processor machines. If you invest in all this hardware, but your software is limiting your utility of it, you're not getting the full bang you spent your buck on.
> But the main point is that you have to really think about how to re-architect your software to take advantage of huge multi-processor machines.
I appreciate the response but it doesn't address my question: given that Let's Encrypt's MySQL-family RDBMS does not implement any of the multi-core/multi-socket/cpu-affinity/lock-free/asyncIO techniques used by databases like ScyllaDB, MemSQL, and VoltDB, why were they seeing 90% CPU utilization on their old Intel servers while the upgraded AMD servers were 25% (the expected range)?
I think mike_d's suggestion is most plausible: they probably included custom functions/procedures that invoke CPU-expensive code. I also thought this was a single-node scale-up architecture but since they are using a three-or-more node master/slave architecture, network I/O could somehow be involved.
On "cloud" servers we usually see that when disk I/O stops being the bottleneck (given enough disk I/O you can account for small amounts of RAM). So I'd guess their disk setup had some capacity left when they upgraded or they just have a lot of reads and the data fits in memory
I kind of did answer a different question, but you sort of affirmed my answer & proved my point. If the "expected range" of your CPU utilization is only 25%, then you are knowingly paying for 3x more CPU than you are actually every planning to use. I suppose if you got that cash money and are willing to burn it, nothing I can say will stop you. I just question the logic of it.
As for why they suddenly dropped? I'll leave that to someone who knows this particular system far better than I do.
I think Jeffbee resolved my confusion; InnoDB uses spin locks that chew up cpu cycles when the request rate exceeds the I/O rate. This is not as bad as it sounds; it uses extra power and generates extra heat but it is not doing any extra real work. InnoDB cores run at either 25% or 90% but the 90% indicates that you need an I/O upgrade not more CPU cycles.
Optimizing for CPU efficiency in a system that is I/O bound will not save significant money. Memory and NVMe are the critical factors and with the master/slave replication used, network I/O significantly undercuts the peak performance this single server is capable of.
> traditional OLTP row stores are I/O bound due to contention (locking and latching). Does anyone have an explanation for this?
I have seen CPU bound database servers when developers push application logic in to the database. Everything from using server-side functions like MD5() to needless triggers and stored procedures that could have been done application side.
Any MySQL with more than about 100 concurrent queries of the same InnoDB table is going to be CPU bound on locks. Their whole locking scheme doesn't scale; it's designed to look great in benchmarks with few clients.
It sounds like you just hit the threshold where defaults don't cut it anymore. With >100 concurrent clients you need to tune your DB for your workload.
innodb_thread_concurrency and innodb_concurrency_tickets would be a good starting point, and optimal values depend on your r/w balance and number of rows touched per type of query.
I'm saying the innodb buffer pool mutex doesn't scale over 100 contenders and you are saying that I can tune mysql so there's never more than that, which it seems to me like we're in agreement.
I did a quick search and it looks like InnoDB implements a spin-lock. Do you see increased CPU utilization when the buffer pool is overloaded? This could explain the behavior described in the article.
Correct, it uses an aggressive spinlock without fairness features for waiters like you'd see with a proper production-quality mutex. This makes it very efficient in the absence of contention (i.e. in trivial benchmarks) and extremely poorly behaved in times of contention.
I am very interested. Could you provide more detail or a link about this.
I would love to understand what they are doing that look good on benchmark but does not actually scale.
Indeed, at $PLACE_OF_WORK over the last 15 years a lot of logic was built in PL/SQL and dbcpu has become one of our most precious resources. For some applications its perfectly reasonable until you need to horizontally scale.
I don't understand why they are trying so hard to avoid sharding. It seems to me that this is a perfect example of an "embarassingly parallel" problem for which sharding would be borderline trivial. What am I missing?
Well, their current solution is still a lot simpler, I suppose. With multiple servers there's a lot of extra administrative stuff you need to deal with.
It's also an example of a "mustn't fail or you break the internet" problem and a "lots of people with nation state resources have a reason to fuck with us" problem. They are prioritising simplicity as a means to security, and that makes sense to me.
It still works like this and they have plenty of headroom with the new solution. Chances are good that when they need to upgrade this solution, technology will have advanced far enough as well. Guessing what their performance demands are (far, far more reads than writes) this seems to work fine for them, so why make it more complicated?
They do talk about read-only replicas, so they have some distribution.
Sharding is like amputating a limb ! It's works really well if the limb is infected with say a flesh-eating bacteria ! But you keep sharding as a last last resort ! It's hard going back
I'm more interested in how they used ZFS to provide redundancy. I always thought ZFS was optimized for spinning platters with SSD's used for persistent caching. In this scenario they used it to set up all their SSD's in mirrored pairs then stripe across that. No ZIL.
They've tweaked a few other settings as well [1].
I'd be curious to see more benchmarks and latency data (especially as they're utilizing compression, and of course checksums are computed over all data not just metadata like some other filesystems).
ZFS was started around 2001 when SSDs weren’t really a thing. It’s goals were, amongst other things, to manage multiple volumes (providing redundancy), to be reliable (most filesystems aim for this), and to support cheap snapshotting. The last two we’re supposed to come from being copy-on-write and this model was an advantage when SSDs became popular as it worked a bit better with their semantics.
What a great read. I think the authors here made great hardware and software decisions. OpenZFS is the way to go, and is so much easier to manage than the legacy RAID controllers imho.
I enjoyed it as well, i'm also appreciative that they shared their configuration notes here. I've been running multiple data stores on ZFS for years now and it's taken a while to get out of the hardware mindset (albeit you still need a nice beefy controller anyway).
Can you explain the advantages of OpenZFS over other filesystems? I know FreeBSD uses ZFS, but I never really understood how it stacks up relative to other technologies...
As someone unfamiliar with db management, is it really less operational overhead to have to physically scale your hardware than using a distributed option with more elastic scalability capabilities?
Relational databases enable some very flexible data access patterns. Once you shard, you lose a lot of that flexibility. If you move away from a relational model, you lose even more flexibility and start having to do much more work in your application layer, and usually start having to use more resources and developer time every step of the way.
The productivity enabled by having one master RDBMS is a big deal, and if they can buy commodity servers that satisfy their requirement, this seems like a fine way to operate.
I agree this is an under appreciated strategy. Someone in my family worked for a hedge fund where one of their simple advantages was they just ran MS SQL on the biggest physical machine available at any given moment. Lots of complexity dodged by just having a lot of brute capacity.
If I had a billion dollars, I'd put a research group together to study the prospects of index sharding.
That is, full table replication, but individual servers maintaining differing sets of indexes. OLAP and single request transactions could be routed to specialized replicas based on query planning, sending requests to machines that have appropriate indexes, and preferably ones where those indexes are hot.
This has been done in popular commercial databases for decades, and is thoroughly researched. As far as I know, these types of architectures are no longer used at this point due to their relatively poor scalability and write performance. I don't think anyone is designing new databases this way anymore, since it only ever made sense in the context of spinning disk.
The trend has been away from complex specialization of data structures, secondary indexing, etc and toward more general and expressive internal structures (but with more difficult theory and implementation) that can efficiently handle a wider range of data models and workloads. Designers started moving on from btrees and hash tables quite a while ago, mostly for the write performance.
Write performance is critical even for read-only analytical systems due to the size of modern data models. The initial data loading can literally take several months with many popular systems, even for data models that are not particularly large. Loading and indexing 100k records per second is a problem if you have 10T records.
We pick up algorithms from 30 years ago all the time. Nobody’s as bad as the fashion industry, but we sure do try.
Part of it is short memories, but part of it is how the cost inequalities in our hardware shifts back and forth as memory or storage or network speeds fall behind or sprint ahead.
... and even if you somehow solved that, the law of physics hits you hard. Latency can be a real performance killer generally and is doubly true in database-type computing.
Why would that be the case? In this case we have already accepted that multiple servers will be involved. That means the limitations of the networking are a given.
Also worth noting that scalability != efficiency. With enough NVMe drives, a single server can do millions of IOPS and scan data at over 100 GB/s. A single PCIe 4.0 x4 SSD on my machine can do large I/Os at 6.8 GB/s rate, so 16 of them (with 4 x quad SSD adapter cards) in a 2-socket EPYC machine can do over 100 GB/s.
You may need clusters, duplicated systems, replication, etc for resiliency reasons of course, but a single modern machine with lots of memory channels per CPU and PCIe 4.0 can achieve ridiculous throughput...
edit: Here's an example of doing 11M IOPS with 10x Samsung Pro 980 PCIe 4.0 SSDs (it's from an upcoming blog entry):
My mind exploded when realizing we can read random IO from disk at 40GB/s, which is faster than my laptop can read from RAM.
https://spdk.io/news/2019/05/06/nvme/
Btw. using SPDK or io_uring?
Yes, with modern storage, throughput is a CPU problem.
And CPU problem for OLTP databases is largely a memory access latency problem. For columnar analytics & complex calculations it's more about CPU itself.
When doing 1 MB sized I/Os for scanning, my 16c/32t (AMD Ryzen Threadripper Pro WX) CPUs were just about 10% busy. So, with a 64 core single socket ThreadRipper workstation (or 128-core dual socket EPYC server), there should be plenty of horsepower left.
As I mentioned memory access latency - I just posted my old article series about measuring RAM access performance (using different database workloads) to HN and looks like it even made it to the front page (nice):
We've made Scylla as async, shared nothing as possible, and we've also started adding C++20 coroutines (to replace futures/promises). We'll be doing more of that in 2021.
It can burst to millions of IOPS, but you get killed on the sustained write workload. Even a high end enterprise NVMe drive will be limited to around 60k IOPs once you exceed its write cache.
Yup indeed it's an issue with NAND SSDs - and it heavily depends on the type (SLC, MLC, TLC, QLC), vendor (controller, memory, "write buffer") sizes etc. I'm doing mostly read tests right now and will move to writes after.
The Samsung PRO 980 I have, are TLC for main storage, but apparently are using some of that TLC storage as a faster write buffer (TurboWrite buffer) - I'm not an expert, but apparently the controller can decide to program the TLC NAND with only 1-bit "depth", they call it "simulated SLC" or something like that. On the 1 TB SSD, the turbowrite buffer can dynamically extend to ~100 GB, if there's unused NAND space on the disk.
Btw, the 3DXpoint storage (Intel Optane SSDs & Micron X1) should be able to sustain crazy write rates too.
Something like a NOSQL style it is kind of built in that it will be distributed. But that backs the compute cost back into the clients. Each node is 'crap' but you have hundreds so it does not matter.
Something like SQL server it comes down to how fast you can get the data out of the machine to clone it somewhere else (sharding/hashing, live/live backups, etc). This is disk, network, CPU. Usually in that order.
In most of the ones I ever did it was almost always network that was the bottleneck. Something like a 10gb network card (was state of the art neato at the time, I am sure you can buy better now) you were looking at saturation of 1GB per second (if you were lucky). That is a big number. But depending on your input transaction rate and how the data is stored it can drop off dramatically. Put it local to the server and you can 10x that easy. Going out of node costs a huge amount of latency. Add in the req of say 'offsite hot backup' and it slows down quickly.
In the 'streaming' world like kafka you end up with a different style and lots of small processes/threads which live on 'meh' machines but you hash it and dump it out to other layers for storage of the results. But this comes at a cost of more hardware and network. Things like 'does the rack have enough power', 'do we have open ports', 'do we have enough licenses to run at the 10GB rate on this router'. 'how do we configure 100 machines in the same way', 'how do we upgrade 100 machines in our allotted time'. You can fling that out to something like AWS but that comes at a monetary cost. But even virtual there is a management cost. Less boxes is less cost.
I’m guessing that since it is for registration and all, the usage might be write-driven, or at least equally balance between writes and reads.
In addition, you really care about integrity of your data so you probably want serializability, avoid concurrency and potential write/update conflicts, and to only do the writes on a single server.
For this reason it sounds to me that partitioning/sharding is the only way to really scale this: have different write servers that care about different primary keys.
> What exactly are we doing with these servers?
Our CA software, Boulder, uses MySQL-style schemas and queries to manage subscriber accounts and the entire certificate issuance process.
The post doesn't specify requirements or application level targets for performance. They show a couple of good latency improvements but don't describe the business or technical impact. The closest we get is this.
> If this database isn’t performing well enough, it can cause API errors and timeouts for our subscribers.
What are the SLO's? How was this being met (or not) before vs after the hardware upgrade? There's a lot of additional context that could have been added in this post. It's not a bad post but instead it simply reduces down to this new hardware is faster than our old hardware.
What exactly needs to be stored once the certificate is created and published in the hash tree? It seems like the kind of data that possibly needn't be stored at all or onto something like Glacier for archival.
AFAIK, nobody has suggested removal of OCSP from end-entity certificates. This article you linked (and the comment you wrote) is purely about removal from intermediate CA certificates.
The majority of OCSP traffic will probably be for end-entity certificates; most OCSP validation (in browsers and cryptographic libraries) is end-entity validation, not leaf-and-chain.
Removal of intermediate CA's OCSP is probably not really relevant to their overall OCSP performance numbers (and if it was, it was likely cached already).
There's an argument for not doing OCSP on end-entity certificates if you can approach the lifetime for the certificates that you'd realistically need for OCSP responses anyway.
Suppose you promise to issue OCSP revocations within 48 hours if it's urgent, and your OCSP responses are valid for 48 hours. That means after a problem happens OCSP revocation takes up to 96 hours to be effective.
If you only issue certificates with lifetimes of 96 hours then OCSP didn't add anything valuable - the certificates expire before they can effectively be revoked anyway.
Let's Encrypt is much closer to this idea (90 days) than many issuers were when it started (offering typically 1-3 years) but not quite close enough to argue revocation isn't valuable. However, the automation Let's Encrypt strongly encourages makes shortening lifetimes practical. Many of us have Let's Encrypt certs automated enough that if they renewed every 48 hours instead of every 60 days we'd barely care.
The solution to excessive OCSP traffic and privacy risk is supposed to be OCSP stapling instead, but TLS servers that can't get stapling right are still ridiculously popular so that hasn't gone so well.
I'm not sure, e.g. Chrome doesn't do OCSP by default, lots of embedded clients like curl won't either. Unless the protocol is terribly broken, that also seems like the kind of use case where 99% of queries just come out of cache and should never hit a database.
Let's Encrypt still has to publish OCSP responses for every non-expired leaf certificate, at least in time that you can always get a new OCSP response before the previous one expires. In practice they have a tighter schedule so that there's a period between "We are not meeting our self-imposed deadline" and "The Internet broke, oops" in which staff can figure out the problem and fix it.
To do this they automatically generate and sign OCSP responses (the vast majority of which will just say the certificate is still good) on a periodic cycle, and then they deliver them in bulk to a CDN. The CDN is who your client (or server if you do OCSP stapling, which you ideally should) talks to when checking OCSP.
To generate those responses they need a way (hey, a database) to get the set of all certificates which have not yet expired and whether those certificates are revoked or not.
My thoughts exactly ... They are creating and storing text. Only thing I can think is that they don't actually need the storage, but just want the lowest possible latency by having a large number of drives.
Interesting they decide to put in PCIE 3.0 NVMe SSD instead of PCI-E 4?
Imagine having 24x Intel Optane [1]. PCI-E 5 is actually just around the corner. I would imagine next time Let's Encrypt could upgrade again and continue to use a Single DB Machine to Serve.
Dirty secret when it comes to current PCIe 4 drives: they have optimised sequential RW speed too much that they have forgotten random RW speed (which is the main driver in databases).
I'm curious why they didn't go with the larger 64 core Epyc. I mean it's double the cost, but I suspect that the huge amount of NVMe SSDs is by far the largest part of the cost anyway. And it seems like CPU was the previous bottleneck as it was at 90%.
We didn't go with the 64-core chips because they have significantly lower clock speeds.
Dual 32-core chips give us plenty of cores while keeping clocks higher for single-threaded performance.
You are correct that the price of the CPUs is almost irrelevant to the overall cost of a system with this much memory and storage. We were picking the ideal CPU, not selecting on CPU price.
Thanks for the answer. I would have guessed that the higher core count outweighs the lower frequence for database usage, but obviously I don't know the details. I think the 90% CPU usage graph just made me nervous enough to want the biggest possible CPU in there.
This is a very high level overview and ideally I would have liked to have seen more application level profiling, e.g. where time is being spent (be it on CPU or IO) within the DB rather than high level system stats. For example the following.
> CPU usage (from /proc/stat) averaged over 90%
Leaves me wondering exactly which metric from /proc/stat they are refering to. I mean it's presumably its user time, but I just dislike attempts to distill systems performance into a few graph comparisons. In reality the realized performance of a system is often better described given a narrative describing what bottlenecks the system.
Unless I misunderstood something, it seems they have a single primary that handles read+write and multiple read replicas for it.
It shouldn't be too difficult given the current use of MariaDB to start using something like Galera to create a multi-master cluster and improve redundancy of the service, unless there are some non-obvious reasons why they wouldn't be doing this.
I think I also see redundant PSUs, would be neat to know if they're connected to different PDUs and if the networking is also redundant.
That's still a very common pattern if you need maximum performance, and can tolerate small periods of downtime. When designing systems, you have to accept some drawbacks. You can forgo a clustered database if you have a strong on call schedule, and redundancy built in to other parts of your infrastructure.
Galera is great, but you lose some functionality with transactions and locking that could be a deal breaker. And up until MySQL 8, there were some fairly significant barriers to automation and clustering that could be a turn off for some people.
Multi-master hardly comes for free in terms of complexity or performance, you're at the mercy of latency. Either host the second master in the same building, in which case the redundancy is an illusion, or host it somewhere else in which case watch your write rate tank
Asynchronous streaming to a truly redundant second site often makes more sense
>> We currently use MariaDB, with the InnoDB database engine.
It is kind of funny how long InnoDB was the most reliable storage engine. I am not sure if MyISAM is still trying to catch up, it used to be much worse than InnoDB. With the emergence of RocksDB there are multiple options today.
The only thing I ever used MyISAM tables for was for storing blobs and full-text search on documents. If your data is mostly read only then it's a decent option out of the box. But if you do even mildly frequent updates then you'll quickly run into problems with the table level locking instead of the row level locking offered by InnoDB
What form factor are those NVMe drives and how are the connected? I see cables, so I'm assuming they're not all plugged straight into their own PCIe slot. Are there a bunch of M.2 headers on the motherboard?
Lets for the sake of argument assume that Lets Encrypt is a malicious actor. Can they easily compromise the security of the websites using their certificates?
Yes. Not just websites using their own certificates though. As a certificate authority they can create certificates for arbitrary domains. There are however a few countermeasures against illegitimate certificates such as certificate pinning and certificate transparency.
Yes, but they better not get caught. It's a trust based model. It's actually Internet stack developers/packagers (anything from protocols in OS or libraries to browsers and devices) that are trusting Let's Encrypt among other certificate authorities.
Then it would be 90 ms -> 81 ms, not 90 ms -> 9 ms. The way I see it, at least. With proper decimation, 90% of what was there remains. ("removal of a tenth", as wikipedia puts it).
So they've completely ignored the lesson Google taught us all 20+ years ago -- lots of cheap servers scales better than an ever more expensive big iron SPOF setup -- and which we're now all using to scale the internet. It's unbelievable to see a design like this in 2021.
So scaling up instead of scaling out. I’m not sure if it’s a viable strategy long term, at the same time we probably don’t want a single CA to handle too many certificates?
Scaling up means each query is faster (3x in this particular case). Scaling out means they can support more clients/domains (more DB shards, more web servers, more concurrency, etc).
These are two distinct axes that are not incompatible with each other.
I'm not aware of any other CA giving out free certificates to anyone. I know that some other providers/hosts will do free certificates, but only to their users (last time I checked).
I'm guessing someone out there's thinking: Why aren't they hosting in the cloud? The cloud being either Amazon or Azure. Surely nothing else exists. Is it really possible to host your own PHYSICAL machine? Does that count as the cloud?!
First, this made me giggle because I run into that attitude all the time. "You're hosting things on a SERVER? Why would anyone do THAT? Heck, you should be putting everything in serverless and avoiding even the vague possibility that you would have to touch anything so degrading and low-class as an operating system. Systems administration? Who does that?"
In all seriousness, however, the decision (likely) has very little to do with that. They're most likely not hosting in the cloud because the current CA/Browser Forum rules around the operation of public CAs effectively don't permit cloud hosting. That's a work in progress, but for the time being, the actual CA infrastructure can't be hosted in the cloud due to security and auditability requirements.
For a service like letsencrypt, the independence factor is also a major reason for self hosting.
I can forsee letsencrypt in the future going to building their own cloud (on their own physical infrastructure), but speaking as a letsencrypt user of their free certificate program, I would lose respect and interest in their service if they went with an AWS or GCP or Azure approach.
The independence from other major players (and the ability of their team to change and move everything about their service, as needed) is one of the reasons I use letsencrypt.
"one of" being the key point here. Let's encrypt has a huge number of sponsors (AWS being only 1 of 9 even if you only count the "platinum level" sponsors), which should allow them to maintain their independence.
They are not hosting their database in the cloud like Amazon or Azure because no cloud provider offers such high performances at a comparable price. Actually I'm not even sure you can get a cloud VM with that many IOs, if you don't mind the pricing.
They are a Public CA and they must undergo a third party compliance audit to operate. The conditions are such you cannot really pass if your infra is in any of those public clouds.
64 cores and 24 NVME drives in a 2U spot on a rack is just insane compared to what we used to have to do to get a beefy database server. And it's not some exotic thing, just a popular mainstream Dell SKU.
If you price it out on Dell's site, you get a retail price north of $200k. That is really what made it clear for me. That you could fit $200k+ worth of DIMMS, Drives, CPUS into a 2U spot :)