A million-dollar engineering problem (segment.com)
85 points by gwintrob 1 hour ago | hide | past | web | 27 comments | favorite





A friend of mine was annoyed that a small service he liked was shutting down.

He contacted the developer who said that they were shutting it down because the server costs were higher than the money they were making.

They were spending 5k a month on AWS crap and claimed it was impossible to get any lower.

He helped them consolidate everything onto a single rented dedicated server costing 400 a month. Now the service is profitable, and will stay up.

It runs way faster on the single server. It also has required less maintenance after the move too.

This kind of shit is everywhere. At this point simply not using AWS is a competitive advantage.

reply


A lot of people are in denial that there isn't using some magic efficiency to cloud services that other datacenters don't have. Primary cost savings on cloud VM's is from overprovisioning. The more abstracted away the service is from the hardware, the more they can overprovision without customers noticing.

The 50%+ profit margins have to be coming from somewhere. AWS is not made of magic, it's made from largely the same PC parts you buy on newegg

reply


> The 50%+ profit margins

Charge more than what it costs you. That's how to make money.

reply


You shouldn't be on AWS in the first place if everything you do can fit on a single server.

Use the right tool for the job.

reply


I have services that do not fit on one server and still I don't need AWS. Distributing load/dividing services on x-xx servers is not rocket science, especially with tools we have available at the moment.

reply


Exactly this. I am using dedicated servers everywhere, I would pay easily 10-20x more for the same if I would use AWS instead.

reply


The Dynamo incident highlights an important lesson when using consistently hashed distributed data stores: make sure the actual distribution of hash keys mirrors the expected distribution. (though to their credit, someone writing an automated test using a hard-coded key was beyond their control).

Incidents like this are generally why rate limits exist, which they don't currently have [0], but perhaps they'll consider a burst limiter in place to dissuade automated tests but not organic human load spikes.

Unfortunately there doesn't seem to be an easy way to fix the per-user ID write bottleneck, short of adding a rate limit to the API – which would push backpressure from Dynamo to the Segment API consumer. Round-robin partitioning of values would fix the write bottleneck, but has heavy read costs because you have to query all partitions. They undoubtedly performed such analysis and found that it didn't fit their desired tradeoffs :)

Great post, very informative. Thanks for sharing! Also, love the slight irony of loading AWS log data into an AWS product (Redshift) to find cost centers.

[0]: https://segment.com/docs/sources/server/http/#rate-limits

reply


I've been joking with friends that my next job will be AWS efficiency guru. I've somewhat optimized our own use, but I think I could use similar, simple rules to get 20% out of a 500k / month budget.

Give me what I save you in 2 months and I'll have a good business :)

reply


Go do it!

I used that exact same model in Conversion Rate Optimization - get your conversion rate up, give me 30% of what we improve.

And built that into a 20+ person digital agency billing millions of dollars a year before being bought out.

Exactly how I did that and you can to:

(1) Wrote topical, detail rich posts similar to the parent here about problems I was solving in CRO for a handful of customers, never disclosing confidential customer info.

(2) Marketed those posts strategically. EG I wrote one about "Which trust symbol gives you the highest return on conversion rate." and then literally just bought Google Adwords of people searching for that question! StackOverflow and other forums also are great ways to market by answering questions (free + put your details in contact info) or running ad campaigns specifically on those topics ($5k+).

(3) Turned the best performing / most viewed posts into "pitches" for speaking gigs at materially similar conferences, most were accepted and I became an "authority".

Every post / conference / etc had a little "Want us to fix it for you? Full service, performance fee model." banner or mention.

Work literally poured in after that and we were lucky enough to be very choosy.

If you can SAVE large enterprises money and are willing to do it on a performance basis you've got a business.

reply


I'm pretty much that in my current job only with a salary. Here's why: I can make all the recommendations I want but change has to be driven by the will of the higher ups, often as high as C-level folks (CIO/CTO). So you pay me to make recommendations, not to actually save you the money, because the second part is largely out of my control.

Having said that, all the cost savings initiatives I've spearheaded are on my resume and LinkedIn profile and I take great satisfaction in optimizing those environments to save the client money.

reply


The simplest one might be to convince a company to reserve 3 years worth of AWS resources and paying upfront. I am in this situation right now, but and it's a tough pill to swallow.

I decided that all of my personal projects will be GCE. It is much more cost efficient already and Google will soon allow me to commit to future usage and pay my commitment as I go (Right now AWS forces you to pay upfront to get the same discount (~50%))

reply


If I have to give that to you it's not "savings"... :D

reply


but you do save money every month after that!

reply


2 months ? That's cheap. Make it a year.

reply


I don't think that's a bad proposition at all. If I were a business person running on AWS, I'd do it.

reply


Great writeup. The "user_id" one really hit home for me. @ Userify (ssh key management, blah blah) we currently have hotspots where some companies integrate the userify shim and put 'change_me' (or similar) in their API ID fields. Apparently, sometimes they don't always update it before putting into production... so we get lots and lots of "Change Me" attempted logins! It's not just one company, but dozens.

Fortunately, we cache everything (including failures) with Redis, so the actual cost is minute/incremental at most, but if you are not caching failures as well as successes, this can result in unexpected and really hard to track down cost spikes. (disclaimer: AWS cert SA, AWS partner)

Segment's trick to detect and record when throttling, and using that as a template for "bad keys" (which presumably are manually validated as well) seems like a great idea as well, but I'd suggest first caching even failure calls on logins if possible, as that probably would have mitigated the need to ever hit dynamo.

PS the name 'project benjamin' for the cost cutting efforts.. pure genius.

reply


Here's amazing way to optimize your AWS bill don't use it. Compared to dedicated you overpaid millions than you spent non-trivial developer time/money to get the number down but you are still overpaying. At 5K/month AWS might make sense (although debatable) at your level of spend it's a really bad idea. At this level an ops team of 2 1 on-site 1 remote (diff time zone) would give you way more flexibility and a ver low bill.

reply


I wish I could invest in Segment. Lot of smart people over there.

reply


These last two engineering blog posts[1] were great reads and they're doing good open source work. Not looking for a job but would love to meet the team if they host an event!

[1] https://segment.com/blog/ui-testing-with-nightmare/

reply


Can you lower our bill now? :)

reply


Reading through this, I would change "Dynamo is Amazon’s hosted version of Cassandra" to "DynamoDB is Amazon's hosted key-value store, similar to Cassandra". The former (to me) sounds like you're saying they vend a managed Cassandra.

reply


Is this the same thing as segment.io?

reply


Yes; they changed their name to Segment at the Series A.

reply


Have you guys considered going bare metal or a hybrid approach? With such immense spendings (even when saving the $1m/yr) it would probably be a lot cheaper.

reply


It could be if our workload was relatively stable and there were spare engineering cycles to undertake a migration and all that this entailed. Neither of these is the case.

Much of what allowed us to implement these savings quickly with a small team was the flexibility afforded by cloud infrastructure. Poor decisions are easy to reverse, but in a bare metal world you better be damn sure what you're doing, which slows down the decision-making process and seriously complicates experimentation. The number of people who know how to build out datacenters at the scale of thousands of machines is vanishingly small.

We'd also need to replace IaaS services like ECS, ELB, ElastiCache, RDS, and DynamoDB. There are certainly off-the-shelf replacements, but we'd need to build-out the expertise within our teams to operate these systems. We're talking roughly a dozen or so engineers working full time for many, many months to get these systems in place from scratch, on top of the even larger effort to design and build out datacenters. I'd much rather plow those cycles into efforts like expanding to multiple regions and improving reliability of internal services. That's a much better return on investment for our customers.

Right now we're in the sweet spot for the cloud. We're way too big to run on trivial amounts of hardware, growing at a rate that makes it difficult to stay ahead of demand in a datacenter-centric world, and too small to justify investing in a scalable hardware build-out.

reply


I was curious about that myself. What's the best way to model the costs associated with maintenance overhead that comes with bare-metal, vs the savings from managing bare-metal servers using a service (like packet), co-locating your own hardware or running your own datacenter.

My gut feeling is that you have to get to an extraordinary size to realize any meaningful savings, but that's primarily based on Dropbox's migration off AWS (https://www.wired.com/2016/03/epic-story-dropboxs-exodus-ama...).

reply


That's a big giant "it depends."

It depends on what your service looks like: CPU intensive, Memory Intensive, Storage intensive? (In reality some unique mix).

You probably won't see a huge savings year one, as you'll be spinning up a lot of new things and have a fairly large CapEx expenditure. Now if your growth pattern is steady/predictable then you should be able to plan out your hardware buys or do a hybrid solution to handle traffic bursts.

One of the nice things about running your own hardware is that there are some costs that are easier to control. Don't need new hardware? Don't need to spend on new hardware for example.

You also have much more control over your environment so you are able to really optimize your code, and infrastructure so that you don't need to scale as large system wise.

But, back to the question on how to model it? You just gotta dig in, and make some educated guesses about performance,test and repeat.

reply




