This is an awesome experiment and write up. I really appreciate the reproducibility.
I would like to see how moving to database that scales write throughput with replicas would behave, namely FoundationDB. I think this will require more than an intermediary like kine to be efficient, as the author illustrates the apisever does a fair bit of its own watching and keeping state. I also think there's benefit, at least for blast radius, to shard the server by api group or namespace.
I think years ago this would have been a non starter with the community, but given AWS has replaced etcd (or at least aspects) with their internal log service for their large cluster offering, I bet there's some appetite for making this interchangable and bringing and open source solution to market.
I share the authors viewpoint that for modern cloud based deployments, you're probably best avoiding it and relying on VMs being stable and recoverable. I think reliability does matter if you want to actually realize the "borg" value and run it on bare metal across a serious fleet. I haven't found the business justification to work on that though!
To be honest, I was building it with the purpose of matching the Etcd scale, but making foundationdb a multitenant data store.
But with the recent craze of scalability , I'll be investing time into understanding how far foundationdb can be pushed as a K8s data store. Stay tuned.
It's definitely a different use case but given they haven't had to tap into their follower replicas for scale, it must be pretty efficient and lightweight. I suspect not having ACLs helps. They also cite a minimum 2MB size, so not expecting exabtyes of little bytes.
I wonder if a major difference is listing a prefix in object storage vs performing recursive listings in a file system?
Even in S3, performing very large lists over a prefix is slow and small files will always be slow to work with, so regular compaction and catching file names is usually worthwhile.
I've been messing with NVMe over TCP at home lately and it's pretty awesome. You can scoop up the last-gen of 10GBe/40GBe networking on eBay for cheap and build your own fast disaggregated storage on upstream Linux. The kernel based implementation saves you some context switching over other network file systems, and you can (probably) pay to play for on-NIC implementations (especially as they're getting smarter).
It seems like these solutions don't have a strong authentication/encryption-in-transit story. Are vendors building this into proprietary products or is this only being used on trusted networks? I think it'd be solid technology to leverage for container storage.
I just use iSCSI at home but using Mellanox’ RoCE which is pretty well performing.
One thing I’m noticing is that most of these storage protocols do, in fact, assume converged Ethernet; that is, zero packet loss and proper flow control.
I haven't experimented with it yet, but I expect that over TCP things degrade more gracefully. It seems earlier iterations of storage over networking didn't want to pay the overhead of TCP and lost out on the general purpose benefits that it brings. IIRC some RoCE iterations aren't routable for example. In theory, you could expose your NVMe over TCP device over the internet.
It seems to me applications taking advantage of NVMe are focused on building out deep queues of operations which may smooth out issues introduced by the network. But only way to know is to benchmark.
It certainly isn't faster or more reliable for a single node, but this is just homelab stuff. Nothing in mine is particularly necessary. I think it's interesting because it's now accessible and receiving support from multiple vendors.
At some scale, it's nice to separate the two. You don't care much about where the disks live vs. where your compute is running. You can evict work from a node, have it reschedule, and not have to worry about replicating the data to another machine with excess capacity. Though I'm no authority on this topic.
The relation to the go toolchain right now is your GOPATH is used to find Go packages when you import them. (It uses the go tool to build the package as a plugin for loading.)
You certainly could put your Neugram files in the same git repository as your Go code, and then "go get" would get them.
I think there is something to be said for something like:
import "github.com/crawshaw/foo/mypkg.ng"
looking for the .ng file in your GOPATH. I need to think about that a bit more.
(Note that a Neugram package is limited to a single file, unlike a Go package. This is for a couple of practical reasons, notably init order, and one philosophical reason: which is Neugram packages shouldn't get as big as Go packages.)
The value of the Makefile is for automating repetitive dev tasks. Consumers of a golang CLI should use `go get` but this makes it easier for contributors who run tests, cut releases, ..
The authors example is clean/simple. Here's an example not so clean/simple example where we needed to do more complex stuff like build for all target platforms[1].
noexcept is a messier than const because it's not straightforward to assess if code you rely on should be throwing or not since that's an implementation detail versus a contract around data ownership. This makes maintenance hard as implementations change.
It'd be nice to see compilers infer the exception behavior themselves and do these optimizations when appropriate.
A lot of large C++ shops re-implement much of the standard library, often to add niche features that don't make sense as a standard. I believe replacing the allocator in a lot of the standard data structures to a non-throwing one will get you pretty close to non-throwing.
Was the connector not an option for you or have things gotten better? I commute in slightly off hours (7am and 4pm) and it consistently takes ~35 minutes door to door.
I would like to see how moving to database that scales write throughput with replicas would behave, namely FoundationDB. I think this will require more than an intermediary like kine to be efficient, as the author illustrates the apisever does a fair bit of its own watching and keeping state. I also think there's benefit, at least for blast radius, to shard the server by api group or namespace.
I think years ago this would have been a non starter with the community, but given AWS has replaced etcd (or at least aspects) with their internal log service for their large cluster offering, I bet there's some appetite for making this interchangable and bringing and open source solution to market.
I share the authors viewpoint that for modern cloud based deployments, you're probably best avoiding it and relying on VMs being stable and recoverable. I think reliability does matter if you want to actually realize the "borg" value and run it on bare metal across a serious fleet. I haven't found the business justification to work on that though!