
A million-dollar engineering problem - gwintrob
https://segment.com/blog/the-million-dollar-eng-problem/
======
Negitivefrags
A friend of mine was annoyed that a small service he liked was shutting down.

He contacted the developer who said that they were shutting it down because
the server costs were higher than the money they were making.

They were spending 5k a month on AWS crap and claimed it was impossible to get
any lower.

He helped them consolidate everything onto a single rented dedicated server
costing 400 a month. Now the service is profitable, and will stay up.

It runs _way_ faster on the single server. It also has required less
maintenance after the move too.

This kind of shit is everywhere. At this point simply not using AWS is a
competitive advantage.

~~~
QuinnyPig
> At this point simply not using AWS is a competitive advantage.

Respectfully, I'm going to disagree. I consult full time on AWS cost
optimization / reduction / understanding.

If you blindly run things on AWS without an understanding of the costing
model, that'll work for a time. As you scale, you start to realize "oh my god
it runs on money."

There are myriad ways around that, but a blanket "never use AWS" isn't going
to address it constructively.

~~~
kafkaesq
_A blanket "never use AWS" isn't going to address it constructively._

But that's not what the commenter was saying, of course.

More like (paraphrasing) "before you go full-hog on AWS, do some simple math
first." Which an amazingly high percentage of people neglect to do, these
days.

~~~
hinkley
Weeks of work can save hours of meetings.

~~~
thecrazyone
interesting thing to say, but did you plan for it to mean something in this
context? :)

~~~
kafkaesq
Nothing special - just the whole "Yeah, saw your email† ... look, I'm so busy,
did you get the site up and running again? I'll get back to you on this after
my vacation" management style which seems to drive a great many "decisions"
that are made in this realm. And on which AWS, implicitly, banks a great deal
of its business model on.

† About the monthly you-know-what bill now well into 5 figures, and climbing.

------
ig1
However they miss the easiest fix: Calling their account rep at AWS and
cutting a deal.

AWS loves startups that could end up being huge customers so they're willing
to slash bills upfront to help you get to growth stage; not only will they
assign you an account rep but they'll have a rep whose job it is to build a
good relationship with your VC. Speak to your account rep. Have your VC speak
to their Amazon rep. Push them to cut your prices.

I've known startups which have cut their monthly bill by six figures going
this route. AWS doesn't want your startup to fail and on the margin these
resources cost them close to zero so they're willing to deal, make use of it.

~~~
hueving
Gross. AWS is becoming the Cisco of cloud computing. Get people hooked when
they are low revenue and twist the screws in later when they get big.

~~~
edanm
In what way is this gross?

They're giving people what is essentially a loan, which they are uniquely able
to give, to help them start a business, in the hopes that this will be an even
bigger business. This is almost the definition of a win-win.

Customers are free to leave at any time or work with other providers, but they
choose AWS because they give them more value.

~~~
Curious42
Wouldn't it irritate AWS if startups left once they got big, despite receiving
special care services from AWS? How do they deal with that?

~~~
pc86
> _How do they deal with that?_

By pricing it in to the discounts in the first place. So smaller discounts.
And by making sure their offerings are priced competitively to comparable
products.

------
kornish
The Dynamo incident highlights an important lesson when using consistently
hashed distributed data stores: make sure the actual distribution of hash keys
mirrors the expected distribution. (though to their credit, someone writing an
automated test using a hard-coded key was beyond their control).

Incidents like this are generally why rate limits exist, which they don't
currently have [0], but perhaps they'll consider a burst limiter in place to
dissuade automated tests but not organic human load spikes.

Unfortunately there doesn't seem to be an easy way to fix the per-user ID
write bottleneck, short of adding a rate limit to the API – which would push
backpressure from Dynamo to the Segment API consumer. Round-robin partitioning
of values would fix the write bottleneck, but has heavy read costs because you
have to query all partitions. They undoubtedly performed such analysis and
found that it didn't fit their desired tradeoffs :)

Great post, very informative. Thanks for sharing! Also, love the slight irony
of loading AWS log data into an AWS product (Redshift) to find cost centers.

[0]: [https://segment.com/docs/sources/server/http/#rate-
limits](https://segment.com/docs/sources/server/http/#rate-limits)

~~~
rbranson
Thanks for the warm feedback!

We currently have an internal project underway to detect hot keys in our
pipeline and collapse back-to-back writes before they're written out to
DynamoDB.

It's difficult to apply throttling on these conditions synchronously within
the ingestion API (i.e. return a 429 based on too many writes to one key)
because of the flexibility of the product: that workload is perfectly
acceptable for some downstream partners. It also gives me pause from a
reliability perspective. We try to keep our ingestion endpoints as simple as
possible to avoid outages, which for our product means data loss.

~~~
kornish
Ah, gotcha. Yeah, it makes sense to avoid synchronously turning away data as
that does defeat the point of the product. And the cost for rejecting false
negatives is high because the moment when a client is receiving lots of data
is when it's most important for them to store it.

If you don't mind answering: for your warehouse offering, do you pull data
from some services (e.g. Zendesk, SFDC), have them push it to you (which is
what I interpreted your "downstream partners" comment to mean – though perhaps
those are "upstream partners"), or a mix of both?

~~~
rbranson
For downstream partners I mean data flows from user -> Segment -> partner.
Event data is ingested through our API & SDKs, and this is fed to partner's
APIs. For Cloud Sources, generally data is pulled from partners using their
public APIs at an interval and pushed to customer warehouses in batches. In a
few special cases partners push data to our APIs.

------
bearjaws
I work in a start up, we own all our own hardware, and it is HELL.

We are forced to pay extremely large sums of money to upgrade our
infrastructure as any purchase requires a redundant piece as well.

For example we have used 90% of our SANs storage, our IO is suffering and now
were looking at purchasing two $10k SANs to upgrade. In the meantime, we have
probably spent over 10k worth of development time to compress, clean up, and
remove any data we don't need. We will not get the weeks of development effort
back, we will continue to spend more money on hardware, and we will continue
to suffer to delay projects because of poor choices.

If we had AWS, we could have written an entire sprints worth of work (also
~$14k worth of IT time) and moved on with our lives.

Clearly, some companies let the budget keep growing, but for most companies
AWS flexibility is worth its weight in gold.

~~~
ryeguy
Why is it always colo vs cloud in these arguments? What about renting
everything monthly from ovh, rackspace, softlayer etc? Many of them can
provision dedicated servers within minutes, too.

~~~
flukus
Also some combination of cloud and private servers. If I was starting up a
Netflix competitor for instance, the site could probably run fine from owned
hardware, but the content itself would be better off being cloud based.

~~~
ryeguy
Ironically netflix does the exact opposite. Also, you definitely don't want to
have the content on the cloud - the data transfer pricing is where they make
significant profit. Both AWS and GCE are ~$100/TB outgoing.

------
teh
I've been joking with friends that my next job will be AWS efficiency guru.
I've somewhat optimized our own use, but I think I could use similar, simple
rules to get 20% out of a 500k / month budget.

Give me what I save you in 2 months and I'll have a good business :)

~~~
aresant
Go do it!

I used that exact same model in Conversion Rate Optimization - get your
conversion rate up, give me 30% of what we improve.

And built that into a 20+ person digital agency billing millions of dollars a
year before being bought out.

Exactly how I did that and you can to:

(1) Wrote topical, detail rich posts similar to the parent here about problems
I was solving in CRO for a handful of customers, never disclosing confidential
customer info.

(2) Marketed those posts strategically. EG I wrote one about "Which trust
symbol gives you the highest return on conversion rate." and then literally
just bought Google Adwords of people searching for that question!
StackOverflow and other forums also are great ways to market by answering
questions (free + put your details in contact info) or running ad campaigns
specifically on those topics ($5k+).

(3) Turned the best performing / most viewed posts into "pitches" for speaking
gigs at materially similar conferences, most were accepted and I became an
"authority".

Every post / conference / etc had a little "Want us to fix it for you? Full
service, performance fee model." banner or mention.

Work poured in after that and we were lucky enough to be very choosy.

If you can SAVE large enterprises money and are willing to do it on a
performance basis you've got a business.

~~~
omarchowdhury
That's awesome, do you have more information about your acquisition?

~~~
aresant
This was several years ago before we had the "growth hackers" lexicon.

One of our customers bought the entire company to get a hold of the core team
+ essentially continued the "30%" deal as a long term incentive as convertible
equity.

Was a good run, and reinforces my post of if you can bring measurable /
substantial value to large enterprise companies amazing things can happen.

Large enterprise have fun accounting terms like "capitalizing an acquisition",
eg they don't buy you out of cash flow. And can even carry debt on the
purchase / incentive programs etc that not only make you more valuable to
them, but create incentives for them to buy smaller companies.

Happy to answer any more specific questions.

~~~
omarchowdhury
What would you say is the most important, specific skill required in
conversion rate optimization?

Is it copywriting? Is it UI/UX? Is it analytics?

Or would it be fairer to say CRO is holistic, requiring a multitude of skills?

~~~
aresant
Great question, by the end of the process we were experts @ holistic level.

But started with very basic web design skills at first.

From the start key "skill" for the pitch / and our clients was we had the
BANDWIDTH to focus on implementing testing as a regular / rigorous process for
them.

EG most - even big cos - wind up with their good web dev resources completely
tied up with 100 other things.

And 10 departments competing for their attention.

I came in and said we can do this, full service and you just have to approve
the tests and give us a dev environment.

But started with just basic web design skills / willingness to read the
multitude of good resources / books / blogs / thought leaders on CRO an apply
their ideas.

EG - Common CRO theme - "try different, high-contrast button colors on your
call to action"

Ok, I can do that.

\- Take existing page

\- set up optimizely or VWO (at the time we used home brew or GoogleWO!)

\- make some really great buttons (or outsource on elance for $10)

\- get client buy in

\- set up the test

\- ensure production readiness / testing

\- go live

\- provide nice reporting format for client weekly that lets them stay
involved and see results, and have confidence in your ability to execute.

\- prepare next test while first one is running, and remember that a huge % of
your best ideas will fail, be agnostic to results but be statistically honest
and educate clients under same process / instruction

Simple example, but you can ramp the complexity up from there.

Like "better to have this "sign up now" button go to another page w/form or
pop a modal window w/form?"

and on down the list, the CRO blogs / experts / etc have 1000 ideas.

And over time got better about understanding what did / didn't work across
clients.

So my win rate crept up from say 20% of tests significant win to 50%+ out of
gate.

And every person on my team I hired because we wanted to run a test that I /
we couldn't implement ourselves :)

What's your specific interest / skill set that you're trying to adapt over?

~~~
jackgolding
I have an analytics background and have been applying for multiple CRO jobs.
It's very frustrating the breadth of skills that people want the inhouse CRO
guy to have (i.e I've had interview questions on JS frameworks, very advanced
stats, non-trivial SQL, CMS systems, photoshop, SEO, paid search, Salesforce
...) especially when I know a lot of whole agencies couldn't tick all those
boxes.

Would love to start freelancing this stuff but it seems like it needs so much
personal branding

------
gravypod
I've always been set aback by how much AWS servers cost. Maybe I'm just too
cheap but you can go to very reputable hosting companies and get things at
fractions of the price.

For example, if I need 16GB of memory and 4 cores here are my options:

    
    
        * AWS (t2.xlarge) $137.62
        * OVH (EG-16)     $79
        * So You Start/OVH (SYS-IP-1S) $42
        * Kimsufi/OVH (KS-3B) $29
        * Hetzner (EX41) €46.41 [Lowest cost model is 32GB of ram]
        * Joe's Data Center (No model name) $45 [Duel socketed L5420 which is 8 cores]
    

That's a crazy price difference and, unlike frmo what I understand about AWS,
you don't need to pay for bandwith for most of these companies or they have
some crazy high limit. IIRC it's common for 20TB to be the standard (FREE)
bandwith limit. You're also on real hardware, not a VM.

Unfortunately not all of these are perfect systems. There are issues with
them. Some have slower CPUs, some have slower disks, some are in other
countries but you can just pick the ones that suit your need for whatever you
have to run. You can also use VMs that are far cheaper then these dedicated
systems. If your workload is under 50% duty cycle on a CPU and under 4GB of
memory you don't need a dedicated server. Buy a VM from one of these
companies:

    
    
        * RamNode (8GB SVZS) $28 [Fully SSD storage, very fast network]
        * OVH (VPS CLOUD RAM 2) $22.39 [12GB(!) of ram and 2 vCores]
        * Linode (Linode 4GB) $20
    

These are all VPSs but they will get the job done and are cheap enough that
you can build something great on a budget.

This is just from a few minutes of remembering when I had to buy my own
hardware. I'm sure this isn't an exhaustive list. You can usually find
information on the web hosting talk forums and on a site called low end box. I
don't have links on hand but they're worth a read.

~~~
djchen
Most of those providers you mentioned aren't as reliable and scalable as AWS,
Google Cloud, Azure, etc. That isn't an apples to apples comparison.

I would not want to host my business on So You Start, Kimsufi, Hetzner, and
especially not Joe's Data Center. I have personally used Joes DC and they have
had numerous outages in the past. Hetzner is known for terminating you for any
sort of "DOS" like traffic, including high PPS. To add to that, those
providers (except maybe OVH?) don't let you scale up and down with automated
APIs.

~~~
mattmanser
Reliability wise, do you actually use these services, as in my experience
that's a complete myth.

One of my clients on a dedicated server has never gone down. The site is
blazingly fast, barely touches 5% CPU and pages have sub 50ms response times.
Deploys take 10 seconds or so, I could make it faster but it's not really
worth the cost/benefit.

My client on Azure, with a significantly lower visitor count, pays 3-10 times
as much, the site is sluggish, takes a long a time to spin up after deploys,
hangs ocassionally, we've caught the whole site being offline a couple of
times then it mysteriously starts working again with nothing in the logs, had
a deploy to one site take down other sites and on top of that there's a 3-5ms
delay between the database and website which causes all sorts of performance
problems when a page makes too many DB requests.

They've had to "scale up" to premium database in the past to handle loads I
_know_ a much cheaper dedicated server would have handled without even
thinking about it.

On top of that, Azure's management portal is super slow, regularly fails to
execute commands and is incredibly frustrating to navigate and use.

The claimed machine you get on cloud aren't anything like as performant as
supposedly similar machines on dedicated.

I'll admit I've generally found AWS to be significantly better than Azure, but
still very expensive.

And AWS went down a couple of weeks ago too.

~~~
gravypod
> on top of that there's a 3-5ms delay between the database and website which
> causes all sorts of performance problems

Yea that's something I don't get too. With all of that IPC going over amazon's
network for all of their services how much time are programs wasting sending
and waiting for messages from other amazon services?

I never really thought about that. If someone could measure this it would be
interesting.

~~~
rbranson
I can't comment on Azure, but most of the complaints about networking issues
in AWS went away when they redesigned their networks around an SDN model with
VPC. We see consistent <1ms RTTs within region, and <300µs within availability
zone (datacenter).

~~~
mattmanser
I've got a reply from Google Compute before after complaining about how bad
Azure was saying they'd has similar problems but solved the latency between
servers and databases. It basically just sounds like the Azure guys have
severely mucked up their infrastructure.

We have a test page that runs 100 "SELECT 1" SQL queries _ON THE SAME
CONNECTION_ between an Azure website and an Azure database and it takes a
whopping 250ms to complete at best. That's literally all it does. And their
infrastructure is so bad that it'll vary from 250ms to 600ms in the space of
20 seconds in peak periods. That's damning, cloud infrastructures are there to
part you from your cash, nothing more.

It's ridiculously bad, in my opinion no professional should _EVER_ recommend
Azure. My client is enamoured with telling people the system runs on Azure,
with its "secure" network, nothing I've told him makes a difference, but he'd
have a much better service if he moved it off Azure.

Because we develop with the entity framework it's quite easy to muck up and
forget an include or two and then suddenly the queries spawn a hundred or two
hundred basic SELECT queries to populate a `Order.OrderItem` or something as
trivial. On a standard dedicated server setup or even your worst 5 year-old
crappy dev laptop that's a ms or 2 extra, but on Azure it's performance death.

There are advantages to using Azure, I admit, the easy deploy from github for
example, but that's more because setting up deploying from a repo to IIS is
such a bad workflow at the moment and the IIS management too. They don't
_want_ to make it easy. With other clients I've setup moderately complicated
deployment scripts, and once the initial work is done, it's much better than
Azure, you run a bat, boom, deploy much faster than Azure manages. 10-20
seconds without even pre-compiling the pages, Azure will take 5 or 10 minutes
to finally get round to doing it and woe betide you doing it at peak times as
it will use up your memory allocation and simply hang for 20 minutes while you
frantically try and restart it while the azure management portal has a massive
spaz (yes, we unfortunately deployed a serious bug, fixed it, tried to re-
deploy while under heavy load, cue website down for a ridiculous amount of
time while their management portal threw a ton of weird and inscrutable errors
before we finally managed to restart it).

~~~
rbranson
Yikes. I've only really had true production experience with AWS. I guess all
clouds aren't made the same.

------
jamiesonbecker
Great writeup. The "user_id" one really hit home for me. @ Userify (ssh key
management for on-prem and cloud) we currently have hotspots where some
companies integrate the userify shim and put 'change_me' (or similar) in their
API ID fields. Apparently, sometimes they don't always update it before
putting into production... so we get lots and lots of "Change Me" attempted
logins! It's not just one company, but _dozens_.

Fortunately, we cache everything (including failures) with Redis, so the
actual cost is tiny at most, but if you are not caching failures as well as
successes, this can result in unexpected and really hard to track down cost
spikes. (disclaimer: AWS cert SA, AWS partner)

Segment's trick to detect and record when throttling, and using that as a
template for "bad keys" (which presumably are manually validated as well)
seems like a great idea as well, but I'd suggest first caching even failure
calls on logins if possible, as that probably would have mitigated the need to
ever hit dynamo.

PS the name 'project benjamin' for the cost cutting efforts.. pure genius.

------
dastbe
Reading through this, I would change "Dynamo is Amazon’s hosted version of
Cassandra" to "DynamoDB is Amazon's hosted key-value store, similar to
Cassandra". The former (to me) sounds like you're saying they vend a managed
Cassandra.

~~~
rbranson
You are totally right. This was an oversight that we didn't catch during our
review process. Thanks so much for the tip, we've updated the post.

------
fourstar
I wish I could invest in Segment. Lot of smart people over there.

~~~
eastbayjake
These last two engineering blog posts[1] were great reads and they're doing
good open source work. Not looking for a job but would love to meet the team
if they host an event!

[1] [https://segment.com/blog/ui-testing-with-
nightmare/](https://segment.com/blog/ui-testing-with-nightmare/)

------
julienba
DISCLAIMER: It can be viewed as an advertisement for my project

Few months back I have start a company for that exact purpose : helping you
decreasing you AWS bill. [https://wizardly.eu/](https://wizardly.eu/)

The focus is around reservation, unused resources and tracking cost evolution
on a day to day basics.

I don't get that much traction ( I'm not a marketer and it's my first SaaS
project, still a lot to learn in that area). I'm looking for more users
feedback, if you willing to invested few minutes in my project that would
really helpful. It's also completely free now.

As the article say, their is no silver bullet. Specially if you need
infrastructure changed. But it's possible to have an interesting cut by using
all the AWS pricing options.

Making teams aware and responsable of their costs is also making a big
differences. Often, people creating and managing infrastructure have no view
of what it is costing for their company.

------
nik736
Have you guys considered going bare metal or a hybrid approach? With such
immense spendings (even when saving the $1m/yr) it would probably be a lot
cheaper.

~~~
rbranson
It could be if our workload was relatively stable and there were spare
engineering cycles to undertake a migration and all that this entailed.
Neither of these is the case.

Much of what allowed us to implement these savings quickly with a small team
was the flexibility afforded by cloud infrastructure. Poor decisions are easy
to reverse, but in a bare metal world you better be damn sure what you're
doing, which slows down the decision-making process and seriously complicates
experimentation. The number of people who know how to build out datacenters at
the scale of thousands of machines is vanishingly small.

We'd also need to replace IaaS services like ECS, ELB, ElastiCache, RDS, and
DynamoDB. There are certainly off-the-shelf replacements, but we'd need to
build-out the expertise within our teams to operate these systems. We're
talking roughly a dozen or so engineers working full time for many, many
months to get these systems in place from scratch, on top of the even larger
effort to design and build out datacenters. I'd much rather plow those cycles
into efforts like expanding to multiple regions and improving reliability of
internal services. That's a much better return on investment for our
customers.

Right now we're in the sweet spot for the cloud. We're way too big to run on
trivial amounts of hardware, growing at a rate that makes it difficult to stay
ahead of demand in a datacenter-centric world, and too small to justify
investing in a scalable hardware build-out.

~~~
patrickg_zill
I would bet you 1000 USD that it would be eye-opening to buy a 64gb RAM, 8
core CPU server, put 4 SSDs in it in RAID10 config, and test your entire stack
on just that box using Docker, KVM, whatever. Just running it beside your desk
and having your other devs beat on it.

You might have to stub out some things using e.g. Cassandra or another open
source alternative; just use a Docker or VM image without the production
redundancy.

You did admit, that you have no idea how much performance is being dropped on
the floor due to not using dedicated hardware. So why not test it?

My belief: you would find you are paying AWS far more than they are actually
worth.

~~~
rbranson
I've operated on-prem and colo physical data centers. I've also been part of
larger companies where someone else deploys baremetal on my behalf. I've been
running infrastructure on top of AWS off-and-on since 2008. I've done a live
migration of one of the largest Internet sites from AWS to baremetal in data
centers. I feel pretty good about keeping our infrastructure in AWS for the
foreseeable future.

~~~
patrickg_zill
I apologize; I didn't read that part of your bio on the site.

Your post seemed to indicate that you hadn't done any calculations for the
bare metal server with your own vms scenario. But clearly you have, if only
mentally.

~~~
rbranson
No worries! I'm not super into self-promotion, but I do think my background is
relevant in this case. Either way, the decisions that go into a strategy like
this involve a mountain of variables.

------
user5994461
Addendum missing from the article: Segment charges per request, these users
who run performance testing with analytics enabled should be very profitable.

Business rule => You gotta optimize for your cash cows. ;)

~~~
tejasmanohar
Actually, our pricing is no longer based on API calls--
[https://segment.com/pricing](https://segment.com/pricing).

~~~
user5994461
$10 / 1000 MTUs

"Monthly Tracked User, or the sum of logged in plus anonymous users that
access your product in a given month. "

P.S. I understand that you have volume discounts.

------
CobrastanJorji
> After a three months of focused work, we managed to cut our AWS bill by over
> one million dollars annually.

That's great, but now I'm wondering how many engineers were paid for three
months of full time work to save $1 million/year on AWS, and how much it costs
the company to pay those engineers for 3 months.

It's also interesting to wonder about how much growth your startup could have
managed if it had had three additional months of work from those engineers.

~~~
grkvlt
If you assume USD 250k per engineer, they would need to have had more than 16
engineers working on this for the 3 months to spend more than a million on
salary. So, I'm guessing they probably came out ahead on that...?

------
nemothekid
>The ALB allows ECS to set ports dynamically for individual containers, and
then pack as many containers as can fit onto a given instance.

Maybe I have a different view, but I've been avoiding running the same service
on the same instance. If I'm running multiple services, I'm doing it for HA,
and if an instance goes down, you could lose all your services. Why would you
run the same service on the same instance?

------
dangoldin
Really interesting read. We've been in a similar mindset and it really comes
down to understanding what's driving your costs and being creative around ways
to minimize it. As others on here mentioned bandwidth and data transfer is a
huge cost so if there's a way to make your caching more aggressive it's a big
and easy win. Beyond that the low hanging fruit is in using reserved instances
- there are a few options there but are significantly cheaper than paying the
on demand price.

A shameless plug is I built a small visualization for the AWS detail log to
help understand the cost drivers. All it is loading the data and visualizing
it across a few dimensions but might come in handy for others -
[https://github.com/dangoldin/aws-billing-details-
analysis](https://github.com/dangoldin/aws-billing-details-analysis)

------
marknadal
Great article. I wish/hope/expect everybody to take this type of
work/attention as seriously as they have. I am as much of a penny pincher as
you can get, so our team built a database adapter on AWS that can handle 100M+
records for $10/day, demo here:
[https://www.youtube.com/watch?v=x_WqBuEA7s8](https://www.youtube.com/watch?v=x_WqBuEA7s8)
.

I'm surprised not more people are price/performance sensitive. My goal is to
get a system up to 10K table inserts/second on a free Heroku instance, we're
already doing on low end hardware 1750 table inserts/second across a federated
system. But price tuning is one of the most important things when engineering
for high load, as well.

Glad to see articles like this. :)

------
pascalxus
Umm... i take issue with the comment "distribution is essentially free".
nonononnononooooo. Other than the rare case of virality, I'd like to know in
what software industry where distribution is essentially free - i'm happy to
be wrong about this, let me know.

the rest of the article was fine.

------
wazoox
A friend of mine developed something to do just that: understand
AWS/Google/Azure costs and compare prices (and save big).

Shameless plug: [https://trackit.io/](https://trackit.io/)

~~~
kierenj
Probably not the ideal response, sorry, but I only got as far as the homepage
- really slow, jerky scrolling, scroll-to-anchors miss the mark, and the
social media links don't work. Then, mouse wheel/trackpad scrolling stopped
working. :/

~~~
wazoox
THe website is probably hosted on AWS :) As for the social media links, they
all work for me...

------
burntwater
Had an interesting experience when I went to the pricing page for Segment -
Had a little sales guy pop up in the lower right corner asking "Would (my
enterprise employer name) like to know if you qualify for a volume discount?"

I only recently started working (and web browsing) from an enterprise network,
so maybe this has been done before, but it's my first time seeing it and I
thought it was really interesting.

 _Edit:_ To be clear, I'm not intrigued at the little guy himself, I'm
intrigued at how personalized it was.

~~~
kornish
Did the box have a "Powered by x" statement anywhere? Would be curious to know
if it's Intercom or something else.

~~~
dcancel
It's Drift (www.drift.com). I'm the CEO. We work closely with Segment and
others including Clearbit to personalize the experience.

------
officelineback
Is this the same thing as segment.io?

~~~
kornish
Yes; they changed their name to Segment at the Series A.

------
Dannymetconan
A great write up. Really wish that they had given some indication about how
much the original bill was. Sounds very effective but 1m of 100m isn't that
much!

------
rumcajz
Given that Elite game [1] used to run on Sinclair with Z80 processor and 48kB
of memory, any complaints about the costs nowadays sound ridiculous. If our
software infrastructure was coded by Elite programmers, we would be able to
run in at 0.01% of the cost.

[1]
[https://en.wikipedia.org/wiki/Elite_(video_game)](https://en.wikipedia.org/wiki/Elite_\(video_game\))

~~~
jaggederest
> It was written in machine code using assembly language, giving much care to
> maximum compactness of code.

From the link. It seems unlikely that people would/could build applications in
the modern world from scratch in assembler. Start with reimplementing the
entire HTTP parsing engine and go from there? No thanks.

------
divbit
Idk how common the setup in the article is, but seems they could easily test
for heatmap singularities such as the one that happened and alert for that.

~~~
divbit
(just alert any keys with heat > mean +- x *standard deviation or something
stupid like that)

~~~
divbit
Actually if you have 3d features, could use edge detection algorithm like
Kirsch after fuzzing the colors a bit.

------
abalone
_> In some cases, we’re currently packing 100-200 containers per instance._

I have a very basic "Docker 101" question here.. Can someone explain how you
might get to >100 containers on a given instance? That seems to imply that
there are multiple containers of the same microservice running on one machine.
Why would you pack duplicate containers like that instead of using one larger
container?

~~~
eastbayjake
> Why would you pack duplicate containers like that instead of using one
> larger container?

Microservices that are scaling horizontally to handle concurrent requests,
etc.

Depending on how seriously your architects take the "micro" in microservice,
you may easily get to 100+ unique microservices/processes each running in
their own Docker container.

~~~
abalone
That's not what's going on here (100-200 unique microservices). Aside from the
fact that _that 's insane_, they specifically state they want to "pack"
multiple containers of the same service on one machine:

 _" You cannot run two containers of the same service on a single host because
they will collide and attempt to listen on the same port. (no bin packing)"_

So, what's the sense in doing that instead of one bigger container?

~~~
biot
They're reimplementing the actor model using Docker containers. Let it crash.

------
ryanmarsh
> blocking any new discovered badly behaved keys. Eventually we reduced
> capacity by 4x.

I wonder how it felt when they discovered how much money they'd burned on such
a dumb little bug. I'd feel like throwing up.

------
stymaar
I don't know if LBO funds are doing this already but I think it's most likely
going to become a trend in the future :

\- buy a SaaS company which hosts all its infra on the cloud

\- move everything to bare metal

\- double the company's margin

\- profit

------
londons_explore
If only they'd spent a couple more dollars on hosting for
[https://segment.com](https://segment.com) so it didn't return HTTP 503
errors...

------
WithHighProb
For us the ECS autoscaling feature is way too naive. We need to do our own
controller on some more responsive metrics. The customization support is
pretty bad.

~~~
dastbe
Would you mind saying more about what you had to build, and what you would
liked to see?

One thing I'd like to see in the future is scaling actions that supported
multipliers, as well as metric math support (scale service to
request_count/acceptable_requests_per_service).

~~~
TheHydroImpulse
One thing we experienced at Segment was the fact that we needed to quickly
handle a surge in volume but couldn't overload partners. Essentially we wanted
something that scaled up quickly at first but was pretty conservative after
that.

We settled on constant increases/decreases using queue depth thresholds but
ideally ECS would support feeding multiple metrics and doing some basic math
to figure out how much we're currently process, how long it would take to
flush the queue, and how many more containers we need to drain in a timely
fashion.

~~~
cobookman
Have you looked into kubernetes?

~~~
TheHydroImpulse
A bunch of people at Segment have looked into it. We manage our infrastructure
with Terraform and operationally IMO Kubernetes introduces too much complexity
compared with ECS and Terraform.

ECS is pretty dead simple (just run an agent on the host) and while it doesn't
offer nearly the same feature set, it's really good operationally.

Some people here are tinkering with it for some non-core services.

------
cowardlydragon
It is my impression that AWS makes its bank on I/O, while everyone makes
buying decisions on CPU/RAM

------
gregjw
Dollar dollar bills y'all

------
williamscales
Is it just me or is segment.com down?

------
qwerty2020
how does a bar chart with no y-axis label show any trend? scale could be
$100-101/mo

------
joshua_wold
Can you lower our bill now? :)

~~~
QuinnyPig
My one step solution to bill lowering involves "blackmailing your TAM." Have
you tried that? ;-)

------
dredmorbius
Clickbait title.

------
qaq
Here's amazing way to optimize your AWS bill don't use it. Compared to
dedicated you overpaid millions than you spent non-trivial developer
time/money to get the number down but you are still overpaying. At 5K/month
AWS might make sense (although debatable) at your level of spend it's a really
bad idea. At this level an ops team of 2 1 on-site 1 remote (diff time zone)
would give you way more flexibility and a ver low bill.

