
The $10m Engineering Problem - fullung
https://segment.com/blog/the-10m-engineering-problem/
======
boulos
Disclosure: I work on Google Cloud.

Awesome writeup! I’ve seen lots of customers do similar “let the packets
spray” on both GCP and AWS. Interestingly, it’s one of the reasons I was so
excited for our “ILB as next hop” [1] feature. Routing to the Service within
the same Zone, _unless_ there’s a failure at which point you wish to go
elsewhere in the Region is a common pattern. I’m excited to see where Traffic
Director and similar service mesh patterns lead in this space. Having everyone
need to roll this by hand, seems needlessly redundant.

As an aside, another personal surprise when I got into an argument over Zone-
to-Zone pricing: GCE only charges for one-way while AWS charges for send and
receive. We should clearly do better marketing :).

[1] [https://cloud.google.com/load-balancing/docs/internal/ilb-
ne...](https://cloud.google.com/load-balancing/docs/internal/ilb-next-hop-
overview)

~~~
dilyevsky
gcp services are always great on marketing booklets but tend to have less than
stellar reliability and you gotta read the fine print. A perfect example is
ilb only supports max 250 backends and is basically unusable for mutli-
regional setups.

------
shrubble
340 VMs is maybe at most, 2 racks of equipment.

So figure 4 racks, 2 racks in each of 2 locations. That's not even 500k in
equipment. Telecoms run 'tandem' and that is good enough even for 911
infrastructure.

10gb of decent quality internet at each location is another 2x 5k per month.
Power space etc. and remote hands is 2k x 4 racks is 8k per month.

So 500k capex plus 18k per month. And how much are they paying AWS?

~~~
dilyevsky
Righto and while you’re setting all of this up (which usually takes months)
your business is going to competition who just spun up a few nodes with a
couple lines or terraform (or just autoscaled to demand). Also 10g link is
hilarious - you will get x30 that on gcp for this many cores.

------
thinkingkong
This is awesome. I'd love to see more blog posts tying business value to
engineering problems, finding ways to measure before/after outcomes, and then
sharing the engineering details. What a great post.

------
cle
> Then when a reader connects, instead of connecting directly to the
> nsqlookupd discovery service, the reader connects to a proxy. The proxy has
> two jobs. One is to cache lookup requests, but the other is to return only
> in-zone nsqd instances for zone-aware clients.

> Our forwarders that read from NSQ are then configured as one of these zone-
> aware clients. We run three copies of the service (one for each zone), and
> then have each send traffic only to the service in its zone.

Isn't this the default behavior of ELB/NLB to begin with? Why not just
configure the zone-aware clients to call zonal LBs, instead of hosting your
own LB? Same with Consul. I'm not understanding what benefit Segment gets from
using Consul vs. calling EC2 Metadata API to discover the AZ and then calling
the appropriate zonal LB endpoint...that's not hard to do and avoids many
extra dimensions of operational complexity.

It's also unclear to me how all this migration to intra-AZ routing affects
Segment's resilience to AZ outages.

~~~
mnutt
The EC2 Metadata API isn’t meant for high-throughput calls, so it’s possible
to hit rate limits even from moderate polling once you get enough nodes
involved.

~~~
manigandham
Why do you have to constantly poll it? It's once on startup to discover the
zone it's running in.

------
ltbarcly3
Buying rack space at a colo costs money, but if you are spending millions of
dollars on AWS you will likely end up spending a few hundred thousand
including a salaried sysadmin to manage the hardware.

This does mean increased management complexity, so you have to build out an
operations team. The total for salaries will be around 400-600k.

In the end you will have some setup costs and you will have to choose a subset
of the features AWS offers, but you'll save millions of dollars per year and
have much better performing hardware and much, much more flexibility.

AWS is extremely expensive.

~~~
boulos
Disclosure: I work on Google Cloud.

The blog post doesn’t make it as direct, but one of their biggest costs was
for networking between datacenters (Availability Zones in AWS). Most
comparisons for “buy a rack at a colo” assume one colo, and a static fleet of
hardware.

If you wanted to compare apples-to-apples, you’d need to have (at least) three
nearby colos with enough capacity to handle one going down entirely at peak
load (“N+1”). Leased lines in a metro area aren’t actually all that expensive,
but like the compute, you also need to purchase that with failure in mind.

tl;dr: Maybe, but the analysis needs to assume the same(ish) reliability
outcome. Otherwise, they could have avoided lots of cost by just running in a
single Zone.

~~~
pm7
> If you wanted to compare apples-to-apples, you’d need to have (at least)
> three nearby colos with enough capacity to handle one going down entirely at
> peak load (“N+1”).

Not true if it's possible to fallback to cloud. That way we can have both high
reliability and low cost (other then during outage/maintenance of
collocation).

~~~
boulos
Hmm. I read the comment as saying “no cloud, because you’ll save so much by
just being on-prem”. And I think an “apples-to-apples” comparison requires an
N+1 setup including both compute and networking.

Hybrid could be many different setups, but before their “zonal affinity”
change it would actually be worse, right? (Egress over Direct Connect is 4x
higher than Zone to Zone, while “internet” egress is 8x). What are you
assuming for the balance of Compute and Networking across at least three
“sites”?

~~~
pm7
> Hmm. I read the comment as saying “no cloud, because you’ll save so much by
> just being on-prem”. And I think an “apples-to-apples” comparison requires
> an N+1 setup including both compute and networking.

That is valid interpretation. I just wanted to say that is you need high
availability it might be cheaper to have one colocation and cloud in standby.

> Hybrid could be many different setups, but before their “zonal affinity”
> change it would actually be worse, right? (Egress over Direct Connect is 4x
> higher than Zone to Zone, while “internet” egress is 8x).

Yes, in/out traffic would be one of more problematic points of such setup, but
there should be some solutions available (BGP?).

> What are you assuming for the balance of Compute and Networking across at
> least three “sites”?

Least expensive should be zero compute in cloud unless there is issue with
collocation. Depending on specific scenario, some storage/databases would have
replication to cloud. I don't know how I would setup networking in such case.

~~~
pm7
> Least expensive should be zero compute in cloud unless there is issue with
> collocation.

One more thing: cloud can be great to scale up in peak utility without buying
servers that will idle most of the time. It's just that using only cloud might
be much more costly, even if it is easier.

------
throwaway_bad
Always interesting to see the scale you have to hit before rewriting from one
language to another saves money (relative to engineering cost).

With node.js: 800 containers, with each container processing 250 messages per
second

With golang: 340 containers, with each container processing 650 messages per
second

Say each one of those containers cost $0.02/hr then that's order of $100k/year
saved!

~~~
kenhwang
Considering a typical HCoL junior dev costs about ~100k/yr, if you can have
one junior dev rewrite your entire codebase in a year, you'll breakeven in
cost after 2 years. Considering a senior dev costs 2-3x that amount per year,
as soon as you have one of those involved for an entire year (odds are, if
it's business critical software you will), your breakeven point comes out to
just under a decade worst case.

I think that just illustrates how risky rewrites are. Very few companies at
that scale can just rewrite everything in that timeframe using that little
resources. Many companies don't even have codebases that will survive a
decade.

~~~
caust1c
We considered a few options before rewriting it. I've got a draft blog post
about the process laying around, but haven't gotten around to getting it over
the finish line.

We definitely knew the risk going into it. Fortunately, it only took us 2
months to rewrite it. I think our strategy for the rewrite is directly
responsible for the speed at which we rewrote it.

------
yowlingcat
This was a great engineering blog post in that it did a good job in
describing, in detail, a large overlying problem (excess intra AZ bandwidth),
its impact on margin (20%) and concrete steps to measure the problem and
solution. This is exactly the type of communication we should be able to use
as an example of the outsize effect engineering can have on the long term
value of the company -- if there's anyway that margin increase can in some way
be turned into CAGR, these increased margins could double the company's
valuation in 4 years (in an ideal world, of course).

Very cool.

------
manigandham
It's better to use multiple regions instead of multiple zones in a single
region. The costs are very similar (and sometimes even the same) especially
with the ridiculous networking fees.

Also object storage is a great way to expand capacity for queueing systems
instead of oversized instances. We either write to Kafka or fall back to
writing files to S3 across different buckets and providers.

------
SkyPuncher
Does anybody else find it funny when companies talk about typically private
metrics (like gross margins) very publicly.

------
elesbao
this looks as much as an engineering problem as a learning problem (how to
build systems for cloud) and management (how to track and establish better
quality in the whole product lifecycle). nice that they are learning their
stuff still and having fun.

------
winrid
They have a Gross Margin Team??? Wow

------
stefan_
Segment sounds like the kinda business that should presumably just serve up
403 errors for all EU traffic. Data laundering analytics to 300 external tools
is shitting on the GDPR.

(Remember this when reading the article: all the traffic, all the VMs, all the
megadollars spent on AWS here are doing nothing but tunnel ( _replicate_ )
analytics data to third-parties, all of whom would be perfectly happy to
receive it directly. It is the definition of waste.)

~~~
patio11
Among other reasons to architect it this way, having the client (web browser)
connect to each analytics provider directly pushes the work to the least
reliable, most network-constrained, and least manageable node in the network.
Segment lets you have the client do de minimis work and have the heavy duty
transfer (and retries, etc) happen from somewhere in AWS, where they're not
connected over a 3G connection. That isn't waste, contingent on the company or
the user getting value out of analytics and analytics-driven decisionmaking,
which is quite plausible.

~~~
cellularmitosis
not to mention that many businesses have multiple client platforms (web, iOS,
android, etc), so implementing anything client-side immediately multiplies the
dev spend.

------
allard
A $10m problem is 1,000,000,000× smaller than a $10M problem.

------
z3t4
Its a bit silly that moving something from one app to another is so
complicated.

------
lifeisstillgood
>>> As a concrete example: a single Salesforce server supports thousands or
millions of users, since each user generates a handful of requests per second.
A single Segment container, on the other hand, has to process thousands of
messages per second–all of which may come from a single customer.

This sounds like the basic problem with Big Data and selling advertising as a
business model ... that eventually even bits aren't free.

I can see how it happens - but I think any business that has as its core ship
everything to our servers in San Francisco is just badly architected - and if
that's your business model you have a bad business model.

no particular comment on segment but a general thought - perhaps most of the
business models today are not very good ones

(I seems to remember a rap lyric start up that spun up a new single threaded
Ruby on Raiks instance for the most trivial request increases)

