
Infrastructure Mistakes Companies Should Avoid - takinola
http://firstround.com/review/the-three-infrastructure-mistakes-your-company-must-not-make/
======
calcsam
There are basically three tiers of startup infrastructure needs, dependent
business model. Your decision process should be driven by which one you are
operating in.

(1) Consumer ad-supported. Pinterest, Instagram, Buzzfeed, etc.

Your CPM is going to be pretty low, so you probably want to run your own infra
-- it's all gonna come down to the margins. Dropbox notably transitioned to
running their own infra after years on AWS.

(2) Freemium software with heavy data ingestion needs, eg enterprise messaging
or CRM. Slack, Streak, etc.

You have pretty high value per customer, but you still have a ton of data
streaming in to your system all the time. Probably use a public cloud
provider, but monitor your bill somewhat carefully lest it get out of control.

(3) Typical B2B workflow SaaS, or very high CPM consumer site. Airbnb,
Zenefits, Gusto, etc.

You store a relatively low amount of data, on the order of megabytes per
customer if not less. Use public cloud infra and make it widely available.
Eliminate "how much will this increase our AWS bill" from discussions about eg
event sourcing, proposed ML experiments, etc.

~~~
reitzensteinm
At my day job, we receive a real money for processing each transaction, which
adds maybe 1kb of information to the database. This makes the scaling story
laughably easy. By the time we're maxing out the biggest database you can buy,
it's IPO time.

For most of my life I worked in games, which is almost the exact opposite
problem. Tiny CPMs, insane traffic. Multiple providers of cheap bandwidth have
gone bankrupt and left me high and dry; but the savings were worth it, and I
continued to chase those deals. It was the right call.

Engineers love to design architecture that (they believe/hope/pray) will scale
to Uber sizes. But if you're not having a conversation with the business as to
whether that's a sensible goal, it's negligence bordering on fraud. You aren't
being paid to needlessly teach yourself cool new technology to solve imaginary
problems.

Which is all to say: ignore parent's advice at your peril.

~~~
davedx
> it's negligence bordering on fraud

That's the first time I've heard designing scalable software called
fraudulent. I'm not sure it means what you think it does. ;)

~~~
adamzochowski
Parent spoke about 'sensible goal' that is needed by business. If engineer
delivers something that is not sensible but overengineered, then maybe it
could be called fraudulent.

Business needs wheelbarrow, receives a tesla instead.

------
rdtsc
Good points.

On the cloud lock-in, it is important to keep in mind that the companies
offering that want to lock you in. The article mentions but I think it needs
to be emphasized. They are not passive agents but more like really sophisticad
drug dealers who study addiction and know how to profit from it. "Hey, psst
look you get a 6 month free trial, just give this a try it costs nothing...".

But cloud providers started to compete with each other harder. As part of that
many are lowering costs and open sourcing cloud orchestration tools which 10
years ago was the super secret sauce. Article covers this a bit as well and I
noticed it too. Running your own cloud on bare metal is becoming more viable.
AWS might be good today, but Google wants your business as well, and
Kubernetes and some bare metal provider might save serious money in the
future.

As for HN driven development, yeah I have seen a couple projects ruined
switching from Python to Go (during last few years there was a story like that
every week on HN). It wasn't that Go was bad, it is just that it destabilized
an existing product without delivering enough benefits.

> One key emerging type of tool Freedman advises looking into implementing is
> a ‘distributed tracing’ system, often modeled after Google’s “Dapper”
> system.

The secret sauce for me is using Erlang (Elixir works as well). Sometimes it
feels like cheating as in "this shouldn't be that easy": distributed tracing,
hotpatching to add a log statement while everything is up running, or
restarting small parts of the service. Imagine say C++ being able to
confidently run gdb on a process, kill a thread, reload new code, let the
thread restart without fear of causing some memory corrupt or leaving a lock
acquired. Like the article said, you can do that with many tools, but having
it be solid, and built it is a huge advantage. Money-wise it just means having
less people and less ops pain. Because if there is one thing that's right up
there with infrastructure costs, it's people's time.

~~~
TeMPOraL
> _On my butt lock-in, it is important to keep in mind that the companies
> offering that want to lock you in. The article mentions but I think it needs
> to be emphasized._

It's funny that companies trying to lock people in their clouds have to be
warned about themselves being locked in a cloud by their service providers. To
follow your example, it's like drug dealer getting addicted by another drug
dealer. They should have known better...

------
xarope
I didn't find it so offensive (adblock), and the advice actually seems pretty
spot-on with my experience.

1) yes it's easy to prototype in cloud, and it's also easy to fall into the
trap of vendor lock-in. Instead, if you are based in the USA (which sadly I'm
not), check on ebay, there's plenty of refurbished or liquidated equipment at
a fraction of brand-new pricing.

BTW, the latter should also be an indicator that not all is fluffy in cloud-
land...!

2) 100% agree on this. Fintech is rearing its ugly head, with 100's (1000's?)
of startups all trying to get a pie of the consumer.

I subscribe to the GNU/KISS philosophy. Keep it simple, keep it as a set of
known tools which speak a common "API" (whether that's just plain and simple
text, XML, JSON etc), train my guys to understand and use them, and you will
achieve far more productivity than jumping every few months to yet-another-
you-beaut toolset guaranteed to solve your CI problems (until after 6 months,
you find out their business strategy is to get acquired, at which point in
time you spend another 6 months adopting another tool... and another...)

3) If anybody has ever seen the power of dtrace, or even what a
straightforward systems/network monitoring system can capture e.g. zabbix,
then they would definitely agree that monitoring is key to ensure the health
of the system(s). Once you get past a whole bunch of scripting alerts on one
server, and wonder how to scale it, then bump into nagios/zabbix etc, you will
kick yourself for not having done so sooner!

------
mtalantikite
Please don't roll your own hybrid cloud or colo if you don't know what you're
doing, particularly from the start. It likely is a distraction from your core
product and, as the article states, can easily tie up 3 or so solid engineers.

The takeaway really is that you should be aware of the trade-offs and lock-ins
you're signing up for, as with anything.

~~~
taneq
Unless you're doing something stupendously horsepower- or data-intensive,
don't use a cloud (roll-your-own or outsourced) at all. Sit a spare PC under a
desk somewhere and run your servers/services on that. Once you figure out what
you're actually doing and you know how your core technology scales, _then_ you
look at what's required to serve it to your target audience.

~~~
sokoloff
Or connectivity intensive. AWS (and other cloud providers) provide a network
that is difficult/impossible to compare to a "PC under a desk somewhere".

They're managing the daily deluge of DDoS attacks and you're paying for less
than 0.0001% of that because a million other customers are sharing the burden.

~~~
falcolas
Amazon absolutely can survive a DDOS attack. But can your wallet? AWS
published a white paper on how to survive a DDOS on AWS that amounted to "out
scale the attack." Doing that could very well be a business ending proposal
right there.

Sure, your website never went down, but now you have an infrastructure bill
you'll have do do another round of funding just to pay off.

~~~
sokoloff
I was more thinking of the DNS amplification and UDP flood type of attacks
that are transparently handled by a cloud provider, but even for an
application attack, you still have a choice to let your site go down rather
than scale up infinitely. (You can cap the scaling.)

~~~
avifreedman
They do manage those - though mainly for protecting themselves, not the point
customer being attacked. In AWS the people I've talked to recently as well as
historically say you'll get pretty uniformly rate-limited? vs. actually doing
per /32 DDoS mitigation type limiting. Has your experience been different (for
volumetric attacks)?

~~~
sokoloff
Our experience has been that the "collateral damage to us" DDoS attacks
vanished entirely from the "set of things we think about" which was not at all
true in some of the colo's we were in.

In terms of application-specific attacks, we have used proxies in AWS to
mitigate attacks against our colocated servers from time to time. AWS handles
some of the volume and some of the types of attack traffic, and we scale and
cache to handle others. This was much cheaper and easier than some of the
Prolexic type solutions.

Agree that they aren't doing anything specific on a host or customer basis,
but just inherent in protecting all of their customers, some of the specific
problems also go away.

~~~
avifreedman
Absolutely agree that collateral damage vs many small-mid-sized hosting
providers is 0 in Amazon, though you do still have to deal with the normal
'noisy neighbor' problem by re-creating instances in a different neighborhood.

------
dexterdog
This seems way too biased against the cloud. It doesn't mention things like a
solid sales relationship with your cloud provider which can help you unearth
all kinds of breaks and incentives. I've been using AWS in dedicated and
hybrid modes since its inception. If you are hitting a pain point cost wise
they will work with you to try to keep you from leaving or migrating services
to on-site.

He also doesn't mention the huge benefit of cost drops that cloud providers
will give you that you will not see when you're on a 3-yr lease and a long-
term DC/bandwidth commit.

~~~
kuschku
Even those breaks and incentives can’t change a 2 order magnitude difference
in cost between containerized products and bare metal, or the single order
magnitude difference between virtualized and bare metal.

If Amazon gives you the same service for a tenth or a hundredth of the price,
sure, but that just doesn’t happen.

~~~
dexterdog
Can you please give an example where you can see those cost savings because I
will migrate to those services tomorrow.

~~~
kuschku
I’m comparing products like
[https://www.hetzner.de/us/hosting/produkte_rootserver/ex41](https://www.hetzner.de/us/hosting/produkte_rootserver/ex41)
with similar performance at DigitalOcean and AWS’ EC2 cloud to get those
numbers. The linked example is an Intel® Core™ i7-6700 with 32 GB DDR4 RAM,
2TB HDD storage (in a RAID, so raw capacity is 4TB, but usable is 2TB) and
1Gbit/s network connection (30TB traffic inclusive), for 40 bucks a month.

Compare with DO: The closest comparable product at DO runs at 320 bucks a
month, and you need to buy extra traffic (you get 23TB less traffic).

On EC2, the best comparable model would be the m3.2xlarge, for 280 bucks a
month. (plus another 100 bucks for the 24/7 phone support).

Now, let’s try getting the same performance with service-as-a-service things:
At heroku, 24/7 phone support alone is 1000 a month. Let’s assume a standard
workload for that machine, we’ll use half of the RAM for Postgresql, and about
512GB storage for the database. With Heroku, that adds 750$ / month.

If we take the cheapest solution – a single dyno of the most powerful type –
we end up with 1250$/month + support. If we use seperate dynos for our
services, as many as the original example server could run, we get 17
Performance M dynos at 250$/month each, overall reaching 4250$.

And Google’s Firebase and Container Engine prices are at the same costs, same
with Amazon Lambda, etc. And Hetzner isn’t especially cheap – any dedicated
hoster will provide you their services at those prices. It’s just the nature
of virtualization and higher abstractions that they are expensive.

If you can handle servers failing in your infrastructure, go with the cheapest
possible option – like KimSufi, Online.net, scaleways, etc. If you want the
standard quality and price, use Hetzner, OVH, and all the other industry-
standard hosters.

~~~
avifreedman
Agreed - LeaseWeb, OVH, Hetzner, and even SoftLayer if you call and negotiate
can all be great options and have been very stable for many folks for
dedicated servers. Generally I recommend that people not make long term
commits, as it gives more leverage if there are network hot spots or other
issues you need their help resolving.

------
dbg31415
> If you discovered the tool on Hacker News and it's less than 18 months old —
> 'Danger, Will Robinson!'

You shouldn't fear new things. Keep it simple, keep it smart, and embrace what
works for your team. Don't shun things because they're old, don't shun things
because they're new. Shun complexity.

How about a real-world examples? Slack, Google Docs, and ZenHub. All of these
added value right out the gate.

I first read about Slack in February 2014 on Hacker News.

* We Don’t Sell Saddles Here – Medium || [https://medium.com/@stewart/we-dont-sell-saddles-here-4c5952...](https://medium.com/@stewart/we-dont-sell-saddles-here-4c59524d650d#.b9koamhnq)

Started playing with it, then started using it. It's helped my team move much
faster. Slack added value. No reason not to use it. Same for ZenHub, same for
Zapier, same for Docker, same for a bunch of other tools where I can say,
first-hand, that being an early-adopter paid off vs. using something "tried
and true."

Oh, and on that note, I freakin love Marker! (=

* Marker - Annotated Screenshots Sent to any Bug Tracking Tool || [https://getmarker.io/](https://getmarker.io/)

~~~
snom380
From the article, it doesn't seem like he's talking about tools like Slack. If
that goes down, maybe it affects your productivity, but probably not your
production systems.

What he warns against is betting too early on core technologies like service
discovery, deployment systems (Docker etc), database systems (MongoDB) And
from reading HN posts, it sure seems like there's no shortage of people being
burnt by that.

~~~
avifreedman
Yep, sorry if it didn't come through clearly enough.

I was talking about (without trying to pick on any particular
projects/vendors) infrastructure glue components - db,
deployment/orchestration, storage, discovery, ...

Maybe 'tool' was the wrong word, and 'component' would have been more clear.

~~~
jacques_chester
How do you feel about PaaSes like Cloud Foundry or OpenShift?

They are, to greater or lesser degrees, able to present a uniform platform to
developers across various backends (CF runs on OpenStack, AWS, Azure, GCP,
vSphere or raw hardware via RackHD).

So long as you deploy your own services using the same tooling (BOSH), it's
possible to hoist and relocate a lot more easily than relying directly on the
IaaS's services.

Disclosure: I work for Pivotal, the majority contributor of engineering to
Cloud Foundry and BOSH.

~~~
avifreedman
I think they can be pretty efficient and if a handle is kept on what's
deployed on top, can be not that much overhead over the cost of the infra
(whether it's owned or cloud). But that keeping a handle on things is key -
starting w/o a DBA is nice but if no one is tracking tables or how they're
used things can get pretty expensive :)

Specifically re CF - have seen a few companies use CF to do multi-
infrastructure, but a lot of the companies we work with have 5-10 roles and
just run them via config mgmt to deploy or now docker +/\- k8s, and don't use
PaaS at all.

~~~
jacques_chester
Thanks for coming back, I was worried that a late reply would be overlooked.

> _But that keeping a handle on things is key - starting w /o a DBA is nice
> but if no one is tracking tables or how they're used things can get pretty
> expensive :)_

At Pivotal we ran into this problem in building service brokers that worked by
interacting with single, large, efficient shared services. Most services lack
the strength of isolation that you are getting for the apps themselves. So on
a shared database, queue, cache etc, the noisy neighbour can really begin to
hurt.

In our 2nd generation of these service brokers we changed our approach.
Previously asking the service broker for a service returned almost immediately
(create account/endpoint/schema/queue/bucket/whatever). Now we actually go and
provision an entirely new, isolated service instance. Luckily BOSH makes this
relatively easy to do.

Essentially we've recreated the journey that led to containers: realising that
while the efficiency of shared instances is nice, it's more important to be
able to enforce functional and non-functional isolation. So now the services
are on par with the apps in terms of their platform behaviour.

The outcome is the same: ops no longer have heavily gateway against bad
developers, because _only_ those developers will be affected by their errors.
I have a very long analogy involving sharehouses that I will skip on this
occasion.

> _but a lot of the companies we work with have 5-10 roles and just run them
> via config mgmt to deploy or now docker + /\- k8s, and don't use PaaS at
> all._

Yeah, the jump from nothing to all-the-things is a pretty big one for people
who are solving the partial problem they see directly in front of them.
Dynamic languages, NoSQL etc are all much more approachable than their
alternatives, because you can build in smaller steps.

We're working on it -- PCFDev and BOSH bootloader are two main prongs and
there's more to come. If you want to give any more feedback of kvetching,
please feel free to email me (jchester@pivotal.io) and I'll connect you to the
right people.

------
throwaway2016a
I can't say I agree with the first two points.

For point #1: cloud services are fairly competitively prices with each other
and using the tools they provide will lock you into a vendor but also
drastically reduce cost. For example, we used to roll our own MySQL and
Postgres now we use AWS RDS and it has saves us so much money I can't believe
we didn't do it in the first place. Does that mean it will be more work to
switch off of AWS? Yes, but it was worth it for us.

For point #2: with that attitude we would have never adopted Docker. And
adopting it early put us well ahead of the game. Now almost everyone seems to
use Docker or something like it but if we waited for it to mature it would
have taken longer to get the rewards.

I completely agree with #3, though. Although back to #1, taking advantage of
cloud provider specific monitoring tools can save a lot of time and money.

Edit: someone is way too downvote happy. Or maybe I'm using downvotes wrong. I
use it for "you're an ass" not "I disagree with you"... I'd love to hear the
opinion of the person who downvoted me and why they feel that way.

~~~
kuschku
> drastically reduce cost.

How did you reduce the cost by switching to AWS? That’s basically impossible.

Usually renting dedicated boxes and running your own instances on them is the
cheapest solution.

I guess that might also be the reason for downvotes: You making such an
extraordinary claim that it seems like trolling.

~~~
throwaway2016a
Ahh, I see now where the confusion was.

I reread part #1 of the article and I was arguing something slightly
different. My argument is that using something like AWS RDS has lower startup
cost than having a DBA manage a DB server manually not that it is cheaper than
running on bare-metal.

I guess what I am saying is running your DB manually on the cloud protects you
from vendor lock in (Cloud Jail as he calls it) but at the expense of greater
upfront costs. It is the worst of both words.

I think startup shouldn't worry about managing their database, it's a
distraction in the early/mid stages. If a cloud provider has a managed version
the break even point where building it out yourself is cheaper is actually
surprisingly far out and if you haven't validated your idea yet, I just don't
think it the ROI justifies it.

Replace database with other infrastructure pieces.

With that said, after re-reading I do agree with the author on the point that
you should consider your options and have a plan.

------
us0r
Popup free version: [http://archive.is/DQJHI](http://archive.is/DQJHI)

~~~
rmason
Am I the only one who couldn't get rid of the pop-ups in the bottom third of
the article? Every time I scrolled the same pop-up that I had just dismissed
popped up again.

It was a very interesting article that made numerous valid points but I came
away thinking a lot less of a VC that couldn't successfully configure a blog.
Someone there proof read the articles?

------
lacker
_If you can get away with it, start out running multi-cloud_

That seems like a pretty bad idea to me. You don't need to run multi-cloud at
the start. Especially if you just use basic services like EC2 and S3 or
something container-based, maybe a database service for something standard
like Postgres, you can avoid getting locked in for quite a while. Early on,
there's just so much to do, going "multi-cloud" is a waste of your limited
time and energy.

------
fooyc
People always forget that there is a world between the "cloud" and leasing
collocation space where you manage your own servers and routers.

Renting dedicated servers is what everyone did before AWS, and it's still the
most affordable hosting.

~~~
kuschku
Exactly. As I showed here [1], there’s almost an order of magnitude between
EC2 and dedicated servers, while EC2 provides no benefit (unless you have
highly variable load – but in that case, you can just use EC2 in addition to
your existing servers).

[1]
[https://news.ycombinator.com/item?id=12627864](https://news.ycombinator.com/item?id=12627864)

------
pasta
As for hosting costs and hip tools: I see a lot of cases where they go hand in
hand.

And example is Magento. This is maybe the hippest webshop with all the bells
and whistles you could ever need. But it's also slow as hell. The amount of
money some companies throw at it to make it fast is insane.

------
ChoHag
> First do no harm. Protect your user experience at all costs. Make their
> trust sacred.

Why isn't this first?

------
panic
The three mistakes:

1\. spending hundreds of thousands of dollars per month on internet services
('They land themselves in Cloud Jail.')

2\. choosing technology based on hype rather than maturity ('They get sucked
in by “hipster tools.”')

3\. not understanding what your computers are actually doing ('They don’t
design for monitorability.')

~~~
samfisher83
>1\. spending hundreds of thousands of dollars per month on internet services

The problem with this is you don't know if your business will be popular or
not. If it isn't and you spend money on well thought out infrastructure well
you wasted time and money. If your product is successful then you can buy
infrastructure later. Also from a business perspective you don't want PP&E on
your balance sheet. That is why you will see so many creative leasing schemes
which fsba cracked down on.

~~~
beambot
Which is why... (paraphrased) Freedman advises startups to watch the following
indicators as a measure of whether they may be approaching the danger zone:

\- When always-on / constantly-growing workloads cross the $100,000/mo. mark,
you may hit the danger zone sooner than you think.

\- Keep # of lockin services in check.

\- Monitor for performance and look for cases where someone else's cloud
starts to cause issues.

This definitely matches infrastructure progression I've seen too!

------
hoodoof
The most invasive popup I have ever encountered.

Also the three points made in the article are, shall we say, dubious advice. I
wanted to use more extreme language but I think it's frowned on at HN.

Here's a sample: "If you discovered the tool on Hacker News and it's less than
18 months old — 'Danger, Will Robinson!'"

Ugh - that what you get for reading advice from venture capitalists.

Don't read this post.

~~~
csydas
Try it on mobile. Doubly worse on an iPhone 4s screen when you have the top
and bottom fifths covered with an ad.

I also agree the warnings were apt but the accompanying advice didn't really
address the issue well. I deal with other people's infrastructure issues every
day, startup and old business alike, and rapid adoption isn't the problem the
article makes it out to be. My experience is that you can barely get most
people to apply a simple critical update to a storage device much less major
infrastructure changes without the person getting an official guarantee from
every vendor in their infrastructure that the change won't disrupt their
workflow.

And the advice on cloud jail just seems premature for most start ups. It's
talking about the owners answering to boards when it seems unlikely to me that
most people needing this advice would even be so far along as to have a board
to answer to. The advice of "don't put all your eggs in one basket" is great,
but that second basket costs money. If anything, I'd imagine a board concerned
about costs would want to consolidate costs not spend more. Boards tend to
make irrationally fiscally conservative choices.

~~~
avifreedman
I agree it's not something people are likely to hit int heir first year, but I
do think some sensitivity to trade-offs at the beginning is good.

Re: board dynamic - the interview was about things to watch for for people
thinking they might get into decent growth. At those stages, especially in
2016, many boards are watching gross margin and unit economics. But ours has
been very supportive from day 1 of having SaaS offerings be HA and DR.

------
dsmithatx
I started reading this article and it has some good points. Then the giant ad
when you barely scroll. Amazing an article that seems well written is
obliterated and unreadable.

A big point I find myself trying to convey to developers who start devops was
well summarized.

When it comes to infrastructure components, keep it as simple as possible.
(And have a healthy amount of skepticism.) “When it comes to your
infrastructure, especially the core components that glue everything together —
storage, load balancing, service discovery — you really need to be using
things that are not, themselves, going to cause problems. You probably have
enough problems with the rest of your application and components.”

I wish I could have finished and shared the article but, sadly the ad as I
scroll down made it unreadable.

------
fatbird
I'd bet lots of money that this is exactly the sort of PR puff piece Paul
Graham describes [0], where a PR firm writes a pseudo-advice column that drops
Kentik's and Avi Freedman's names a lot. Bland, generic infrastructure advice
phrased in an insidery tone that leaves lots of Google trail for the company
that paid for this sponsored advertising. Don't waste your time, it's nothing
you can't figure out on your own.

[0]
[http://paulgraham.com/submarine.html](http://paulgraham.com/submarine.html)

~~~
avifreedman
As to the specific advice, sorry you didn't think it was valuable.

But re: it being a PR fluff piece -

Nope... Not in this case, and I would bet against that being the case with
First Round Review in general.

I'd bet a reasonable sum that every article int the First Round Review was
spoken and/or written by the person being interviewed/quoted.

For this article, Camille prepped me with some questions, then we spent an
hour on the phone, and she got me a draft that I made suggestions on
(especially the monitoring section, which was much weaker originally).

------
wtbob
There's some really, _really_ good stuff here — well worth the read.

------
cbau
Clickbait title.

~~~
CaptSpify
How is it clickbait? He lays out 3 pretty clear mistakes, and ways to avoid
them.

