Sure, but there are some fundamentals about latency that any programmer should know [0] (absolute values outdated, but still useful as relative comparisons), like “network calls are multiple orders of magnitude slower than IPC.”
I’m assuming you’re an employee of the company based on your comments, so please don’t take this poorly - I applaud any and all public efforts to bring back sanity to modern architecture, especially with objective metrics.
And yeah you’re right in hindsight it was a terrible idea to begin with
I thought it could work but didn’t benchmark it enough and didn’t plan enough. It all looked great in early POCs and all of these issues cropped up as we built it
You don't need experience and there is not really a lot to know about "distributed systems" in this case, that's basic CS knowledge about networks, latency and what "serverless" actually is, you can read about it.
To be honest, to me it reads like people who don't understand the problem they're solving, haven't acquired the necessary knowledge to solve it (either by learning themselves or by asking/hiring people who have it), and seeing such an amateurish mistake doesn't inspire confidence for the future.
You should either hire people that know what they are doing or upgrade your knowledge about systems you are using before making decisions to use them.
Sometimes I see a post about sorting algorithms online. Some people seem to benefit from reading about these things, but often, I find there isn't much new information for me. That's OK, because I know somebody somewhere benefits from knowing this.
It is your decision to make this a circlejerk of musings about how the company must be run by amateurs. Whatever crusade you're fighting in vividly criticising them is not valuable at all. People need to learn and share so we can all improve, stop distracting from that point.
What did your internal discussion conclude for the question "Why did we not take a step back earlier and think, why are we doing it this way?"
Im genuinely curious because this is not singling out your team or org, this is a very common occurrence among modern engineering teams, and I've often found myself on the losing end of such arguments. So I am all ears to hear at least one such team telling what goes on in their mind when they make terrible architecture decisions and if they learned anything philosophical that would prevent a repeat.
Oh we had it coming for quite some time and knew we would need to rebuild it, we just didn’t have the capacity to do it unfortunately.
I was working on it on and off moving one endpoint at a time but it was very slow until we hired someone who was able to focus on it.
It didn’t feel good at all. We knew the product had massive flaws due to the latency but couldn’t address it quickly. Especially cause we he to build more workarounds as time went on. Workarounds we knew would be made redundant by the reimplementation.
I think we had that discussion if “wtf are we doing here” pretty early, but we didn’t act on it in the beginning, instead we tried different approaches to make it work within the serverless constraints cause that’s what we knew well.
I have had CTOs (two in my career) tell me we had to use our AWS credits since they were going to expire worthless. Both experiences were at vc-backed startups.
I doubt they literally said “perfect for low latency APIs” but their messaging is definitely trying to convince you that they’re fast globally, just look at the workers.ckoudflare.com page
Have you done new benchmarks since Cloudflare announced their latest round of performance improvements for Workers?
Just curious if this workload also saw some of the same improvements (on a quick read it seems like you could have been hitting the routing problem CF mentions)
Really great writeup. The charts tell the story beautifully, and the latency gains are surely a win for your company and customers. I always wonder about the tradeoffs. Is there a measurable latency difference for your non-colocated customers? What does maintenance look like for your Go servers? I assume that your Cloudflare costs dropped?
It’s faster for non-colocated customers too weirdly
I think cause connections can be reused more often. Cloud flare workers are really prone to doing a lot of TLS handshakes cause they spin up new ones constantly
Right now were just hang aws far hate for the go servers, so there really isn’t much maintenance at all. We’ll be moving that into eks soon though cause we are starting to add more stuff and need k8s anyways