More

jeffinhat · 2025-10-18T22:02:27 1760824947

This is an awesome experiment and write up. I really appreciate the reproducibility.

I would like to see how moving to database that scales write throughput with replicas would behave, namely FoundationDB. I think this will require more than an intermediary like kine to be efficient, as the author illustrates the apisever does a fair bit of its own watching and keeping state. I also think there's benefit, at least for blast radius, to shard the server by api group or namespace.

I think years ago this would have been a non starter with the community, but given AWS has replaced etcd (or at least aspects) with their internal log service for their large cluster offering, I bet there's some appetite for making this interchangable and bringing and open source solution to market.

I share the authors viewpoint that for modern cloud based deployments, you're probably best avoiding it and relying on VMs being stable and recoverable. I think reliability does matter if you want to actually realize the "borg" value and run it on bare metal across a serious fleet. I haven't found the business justification to work on that though!

ymelnyk · 2025-10-19T21:04:54 1760907894

Here you go, Kine FoundationDB backend https://github.com/melgenek/f8n

To be honest, I was building it with the purpose of matching the Etcd scale, but making foundationdb a multitenant data store.

But with the recent craze of scalability , I'll be investing time into understanding how far foundationdb can be pushed as a K8s data store. Stay tuned.

jeffinhat · 2025-10-21T01:42:47 1761010967

Awesome, I will!

It would be great to see where the limits are with this approach.

I think at some point, you need to go deeper into the apiserver for scale than an API compatible shim, but this is just conjecture and not real data.

jeffinhat · 2025-09-19T04:22:26 1758255746

It's definitely a different use case but given they haven't had to tap into their follower replicas for scale, it must be pretty efficient and lightweight. I suspect not having ACLs helps. They also cite a minimum 2MB size, so not expecting exabtyes of little bytes.

I wonder if a major difference is listing a prefix in object storage vs performing recursive listings in a file system?

Even in S3, performing very large lists over a prefix is slow and small files will always be slow to work with, so regular compaction and catching file names is usually worthwhile.

jleahy · 2025-09-19T06:10:24 1758262224

2MB median to be fair, so half of our files are under 2MB.

jeffinhat · on Dec 5, 2024

They are developing this simulation concept as a platform: https://antithesis.com/

jeffinhat · on July 5, 2022

See also- Snap: a Microkernel Approach to Host Networking: https://research.google/pubs/pub48630/

jeffinhat · on June 11, 2021

I've been messing with NVMe over TCP at home lately and it's pretty awesome. You can scoop up the last-gen of 10GBe/40GBe networking on eBay for cheap and build your own fast disaggregated storage on upstream Linux. The kernel based implementation saves you some context switching over other network file systems, and you can (probably) pay to play for on-NIC implementations (especially as they're getting smarter).

It seems like these solutions don't have a strong authentication/encryption-in-transit story. Are vendors building this into proprietary products or is this only being used on trusted networks? I think it'd be solid technology to leverage for container storage.

stingraycharles · on June 11, 2021

I just use iSCSI at home but using Mellanox’ RoCE which is pretty well performing.

One thing I’m noticing is that most of these storage protocols do, in fact, assume converged Ethernet; that is, zero packet loss and proper flow control.

Is this also the case with NVMe over TCP?

jeffinhat · on June 12, 2021

I haven't experimented with it yet, but I expect that over TCP things degrade more gracefully. It seems earlier iterations of storage over networking didn't want to pay the overhead of TCP and lost out on the general purpose benefits that it brings. IIRC some RoCE iterations aren't routable for example. In theory, you could expose your NVMe over TCP device over the internet.

It seems to me applications taking advantage of NVMe are focused on building out deep queues of operations which may smooth out issues introduced by the network. But only way to know is to benchmark.

isotopp · on June 14, 2021

The point of TCP is to provide protection against packet loss and reordering, and to provide flow control.

effie · on June 11, 2021

> build your own fast disaggregated storage on upstream Linux

Why is this better than just connecting the drive via PCIe bus directly to the CPU?

jeffinhat · on June 12, 2021

It certainly isn't faster or more reliable for a single node, but this is just homelab stuff. Nothing in mine is particularly necessary. I think it's interesting because it's now accessible and receiving support from multiple vendors.

At some scale, it's nice to separate the two. You don't care much about where the disks live vs. where your compute is running. You can evict work from a node, have it reschedule, and not have to worry about replicating the data to another machine with excess capacity. Though I'm no authority on this topic.

wmf · on June 12, 2021

If you want a whole drive, sure. But if you want to virtualize storage (e.g. split up one drive) you need a network protocol.

effie · on June 12, 2021

I don't need a network protocol to use one drive for 100 VMs, PV drivers in Xen or VirtIO in KVM work well.

jeffinhat · on Nov 4, 2017

This is awesome. I've been trying to reach for go instead of bash more and more and this is exactly what I need for the small stuff.

Any ideas how it could relate to the go toolchain? could I stick my Neugram files in a repo and go get them?

crawshaw · on Nov 4, 2017

The relation to the go toolchain right now is your GOPATH is used to find Go packages when you import them. (It uses the go tool to build the package as a plugin for loading.)

You certainly could put your Neugram files in the same git repository as your Go code, and then "go get" would get them.

I think there is something to be said for something like:

    import "github.com/crawshaw/foo/mypkg.ng"

looking for the .ng file in your GOPATH. I need to think about that a bit more.

(Note that a Neugram package is limited to a single file, unlike a Go package. This is for a couple of practical reasons, notably init order, and one philosophical reason: which is Neugram packages shouldn't get as big as Go packages.)

jeffinhat · on Sept 20, 2017

The value of the Makefile is for automating repetitive dev tasks. Consumers of a golang CLI should use `go get` but this makes it easier for contributors who run tests, cut releases, ..

The authors example is clean/simple. Here's an example not so clean/simple example where we needed to do more complex stuff like build for all target platforms[1].

[1] https://github.com/cloudfoundry-community/stackdriver-tools/...

jeffinhat · on Dec 15, 2016

noexcept is a messier than const because it's not straightforward to assess if code you rely on should be throwing or not since that's an implementation detail versus a contract around data ownership. This makes maintenance hard as implementations change.

It'd be nice to see compilers infer the exception behavior themselves and do these optimizations when appropriate.

jeffinhat · on Dec 15, 2016

A lot of large C++ shops re-implement much of the standard library, often to add niche features that don't make sense as a standard. I believe replacing the allocator in a lot of the standard data structures to a non-throwing one will get you pretty close to non-throwing.

jeffinhat · on Feb 3, 2016

Was the connector not an option for you or have things gotten better? I commute in slightly off hours (7am and 4pm) and it consistently takes ~35 minutes door to door.

marssaxman · on Feb 3, 2016

The connector was not an option for me. I quit in 2009; things may well have gotten better since then.