ca508's comments

ca508 · 2025-01-18T03:22:13 1737170533

We have a distributor we work with - just because it makes import/export a lot easier. But we get to interface directly with Supermicro for the technical/design stuff, and they're super awesome. If you're looking in the US, reach out to their eStore - really great fuss-free turnaround and all direct.

ca508 · 2025-01-18T03:10:51 1737169851

the good news on this is that we've got a tonne of deep-dive material on networking and whitebox switches we cut from this post. We'll definitely be talking more about this soon (also cos' BGP is cool).

ca508 · 2025-01-18T01:34:59 1737164099

correct; I think the first version of our tool sprung up in the space of a couple of weekends. It wasn't planned, my colleague Pierre who wrote it just had a lot of fun building it.

ca508 · 2025-01-18T01:10:04 1737162604

All valid points - and our ideas for Gen 2 sound directionally similar - but those are at crayon drawing stage.

When we started, we didn't have much of an idea about what the rack needs to look like. So we chose a combination of things we thought we could pull this off. We're mostly software and systems folks, and there's a dearth of information out there on what to do. Vendors tend to gravitate towards selling BGP+EVPN+VXLAN or whatever "enterprise" reference designs; so we kinda YOLO'ed the Gen 1. We decided to spend extra money if we could get to a working setup sooner. When the clock is in cloud spend, there's uh... lots of opportunity cost :D.

A lot of the chipset and switch choices were bets and we had to pick and choose what we gambled on - and what we could get our hands on. The main bets this round were eBGP to the hosts with BGP unnumbered, SONiC switches - this lets us do a lot of networking with our existing IPv6/Wireguard/eBPF overlay and a debian based switch OS + FRR (so fewer things to learn). And ofc. figuring out how to operationalise the install process and get stuff running on the hardware as soon as possible.

Now we've got a working design, we'll start iterating a bit more on the hardware choice and network design. I'd love for us to write about it when we get through it. Plus I think we owe the internet a rant on networking in general.

Edit: Also we don't use UniFi Pro / Uniquity gear anywhere?

ca508 · 2025-01-18T00:43:23 1737161003

ah that's my bad - I wrote this in Dec, we only published in Jan. Obv. missed updating that.

Timeline wise; - we decided to go for it and spend the $$$ in Oct '23 - Convos/planning started ~ Jan '24 - Picked the vendors we wanted by ~ Feb/Mar '24 - Lead-times, etc... meant everything was ready for us to go fit the first gear by mostly ourselves at the start of May (that's the 5mo) - We did the "proper" re-install around June, followed closely by the second site in ~ Sep, around when we started letting our users on it as a open beta - Sep-Dec we just doubled down on refining software/automation and process while building out successive installs

Lead times can be mind numbing. We have certain switches from Arista that have a 3-6 mo leadtime. Servers are build to order, so again 2+ months depending on stock. And obv. holidays mean a lot of stuff shuts down around December.

Sometimes you can swap stuff around to get better lead-times, but then the operational complexity explodes because you have this slightly different component at this one site.

I used to be a EEE, and I thought supply chain there was bad. But with DCs I think it's sometimes worse because you don't directly control some parts of your BoM/supply chain (especially with build-to-order servers).

ca508 · 2025-01-18T00:05:24 1737158724

oh yes we want to; I even priced a couple out. Most of the SKUs I found were pretty old, and we couldn't find anything compelling to risk deploying at the scale we wanted. It's on the wishlist, and if the right hardware comes along; we'll rack it up even as a bet. We maintain Nixpacks (https://nixpacks.com/docs/getting-started), so for most of our users we could rebuild most their apps for ARM seamlessly - infact we mostly develop our build systems on ARM (because macbooks). One day.

VTimofeenko · 2025-01-18T01:37:03 1737164223

> We maintain Nixpacks

I _knew_ Railway sounded familiar.

Out of curiosity: is nix used to deploy the servers?

justjake · 2025-01-18T01:50:15 1737165015

Not ATM. We use it in a lot of our stack, so we will likely pull it in in the future

VTimofeenko · 2025-01-18T03:02:37 1737169357

Got it. Especially interested to see how you set up PXE. Seen a few materials out there but never got around to doing it in my lab.

Looking forward to more blogposts!

ca508 · 2025-01-17T23:26:21 1737156381

we evaluated a lot of commercial and oss offerings before we decided do go build it ourselves - we still have a deploy of netbox somewhere. But our custom tool (Railyard) works so well because it integrates deeply into the our full software, hardware and orchestration stack. The problem with the OSS stuff is that it's almost too generic - you shape the problem to fit its data model vs. solve the problem. We're likely going to fold our tool into Railway itself eventually - want to go on-prem; button click hardware design, commission, deploy and devex. Sorta like what Oxide is doing, but approaching the problem from the opposite side.

ca508 · 2025-01-17T22:58:45 1737154725

> It would be nice to have a lot more detail

I'm going to save this for when I'm asked to cut the three paras on power circuit types.

Re: standardising layout at the rack level; we do now! we only figured this out after site #2. It makes everything so much easier to verify. And yeah, validation is hard - manually doing it thus far; want to play around with scraping LLDP data but our switch software stack has a bug :/. It's an evolving process, the more we work with different contractors, the more edge cases we unearth and account for. The biggest improvement is that we have built a internal DCIM that templates a rack design and exports a interactive "cabling explorer" for the site techs - including detailed annotated diagrams of equipment showing port names, etc... The screenshot of the elevation is a screenshot of part of that tool.

> What does your metal->boot stack look like?

We've hacked together something on top of https://github.com/danderson/netboot/tree/main/pixiecore that serves a debian netboot + preseed file. We have some custom temporal workers to connect to Redfish APIs on the BMCs to puppeteer the contraption. Then a custom host agent to provision QEMU VMs and advertise assigned IPs via BGP (using FRR) from the host.

Re: new DCs for failure scenarios, yeah we've already blown breakers etc... testing stuff (that's how we figured out our phase balancing was off). Went in with a thermal camera on another. A site in AMS is coming up next week and the goal for that is to see how far we can push a fully loaded switch fabric.

sitkack · 2025-01-17T23:15:43 1737155743

Wonderful!

The edge cases are the gold btw, collect the whole set and keep them in a human and machine readable format.

I'd also go through and using a color coded set of cables, insert bad cables (one at a time at first) while the system is doing an aggressive all to all workload and see how quickly you can identify faults.

It is the gray failures that will bring the system down, often multiple as a single failure will go undetected for months and then finally tip over an inflection point at a later time.

Are you workloads ephemeral and/or do they live migrate? Or will physical hosts have long uptimes? It is nice to be able to rebaseline the hardware before and after host kernel upgrades so you can detect any anomalies.

You would be surprised about how larger of a systemic performance degradation that major cloud providers have been able to see over months because "all machines are the same", high precision but low absolute accuracy. It is nice to run the same benchmarks on bare metal and then again under virtualization.

I am sure you know, but you are running a multivariate longitudinal experiment, science the shit out of it.

ca508 · 2025-01-17T23:46:15 1737157575

Long running hosts at the moment, but we can drain most workloads off a specific host/rack if required and reschedule it pretty fast. We have the advantage of having a custom scheduler/orchestrator we've been working on for years, so we have a lot of control on that layer than with Kube or Nomad.

Re: Live Migration We're working on adding Live Migration support to our orchestrator atm. We aim to have it running this quarter. That'll makes things super seamless.

Re: kernels We've already seen some perf improvements somewhere between 6.0 and 6.5 (I forget the exact reason/version) - but it was some fix specific to the Sapphire Rapids cpus we had. But I wish we had more time to science on it, it's really fun playing with all the knobs and benchmarking stuff. Some of the telemetry on the new CPUs is also crazy - there's stuff like Intel PCM that can pull super fine-grained telemetry direct from the CPU/chipset https://github.com/intel/pcm. Only used it to confirm that we got NUMA affinity right so far - nothing crazy.

sitkack · 2025-01-18T00:23:38 1737159818

Last thing.

You will need a way to coordinate LM with users due them being sensitive to LM blackouts. Not many workloads are, but the ones that are are the kinds of things that customers will just leave over.

If you are draining a host, make sure new VMs are on hosts that can be guaranteed to be maintenance free for the next x-days. This allows customers to restart their workloads on their schedule and have a guarantee that they won't be impacted. It also encourages good hygiene.

Allow customers to trigger migration.

Charge extra for a long running maintenance free host.

It is good you are hooked into the PCM already. You will experience accidentally antagonistic workloads and the PCM will really help debug those issues.

If I were building a DC, I put as many NICs into a host as possible and use SR-VIO to pass the nics into the guests. The switches should be sized to allow for full speed on all nics. I know it sounds crazy but if you design for a typical crud serving tree, you are a saving a buck but making your software problem 100x harder.

Everything should have enough headroom so it never hits a knee of a contention curve.

ca508 · 2025-01-17T22:33:30 1737153210

We didn't find many good up-to-date resources online on the hardware side of things - kinda why we wanted to write about it. The networking aspect was the most mystical - I highly recommend "BGP in the datacenter" by Dinesh Dutt on that (I think it's available for free via NVidia). Our design is heavily influenced by the ideas discussed there.

mdaniel · 2025-01-18T04:30:46 1737174646

the title page says 2017 if that matters to anyone: https://docs.jetstream-cloud.org/attachments/bgp-in-the-data...

chatmasta · 2025-01-18T00:41:15 1737160875

What was the background of your team going into this project? Did you hire specialists for it (whether full time or consultants)?

ca508 · 2025-01-18T01:31:57 1737163917

We talked to a few, I think they're called MSPs? We weren't super impressed. We decided to YOLO it. There are probably great outfits out there, but it's hard to find them through the noise. We're mostly software and systems folks, but Railway is a infrastructure company so we need to own stuff down to the cage-nut - we owe it to our users. All engineering, project management and procurement is in-house.

We're lucky to have a few great distributors/manufacturers who help us pick the right gear. But we learnt a lot.

We've found a lot of value in getting a broker in to source our transit though.

My personal (and potentially misguided) hot take is that most of the baremetal world is stuck in the early 2000's, and the only companies doing anything interesting here the likes of AWS,Google and Meta. So the only way to innovate is to stumble around, escape the norms and experiment.

chatmasta · 2025-01-18T01:44:30 1737164670

Did your investors give you any pushback or were they mostly supportive?

justjake · 2025-01-18T01:54:31 1737165271

We're blessed with some kickass investors. They gave us just the right level of scrutiny. We were super clear about why we wanted to do this, we did it, and then they invested more money shortly after the first workloads starting running on metal

If you're looking for great partners, who actually have the gal to back innovation, you'd be hard pressed to do better than Redpoint (Shoutout Erica and Jordan!)

ca508 · 2025-01-17T22:22:13 1737152533

We built some internal tooling to help manage the hosts. Once a host is onboarded onto it, it's a few button clicks on an internal dashboard to provision a QEMU VM. We made a custom ansible inventory plugin so we can manage these VMs the same as we do machines on GCP.

The host runs a custom daemon that programs FRR (an OSS routing stack), so that it advertises addresses assigned to a VM to the rest of the cluster via BGP. So zero config of network switches, etc... required after initial setup.

We'll blog about this system at some point in the coming months.