

Google Infrastructure - brlewis
http://piaw.blogspot.com/2010/04/infrastructure.html

======
strlen
"Build only the tools you need as the need arises."

This is certainly true, although some form of generic infrastructure is still
universally needed: most importantly operations, build, test and deploy tools.
These tools allow for ability to iterate much quicker. Much of the foundation
for this infrastructure is now available as open source, which means it's not
difficult to customize to a start-ups needs (no need to build from scratch):
git SCM, puppet/chef/bcfg2 for configuration management, nagios for alerting,
capistrano for deployment, hudson CI, etc...

There's also a corollary to this statement: when you _need_ infrastructure,
you shouldn't shy away from building it. First-class, productivity increasing
tools and infrastructure distinguishes great software companies from average
ones. It also requires strong engineering teams and is often indicative of
such (great heuristic when deciding to go to a start-up or not: have they done
anything beyond reading from a database and outputting to a screen?).

I've seen companies whose solution to infrastructure woes had been to shell
for expensive solutions commercial (or very ill fitting open source ones) and
then _force_ development teams to use them ("you must build this using Oracle
SOA, oh and convert the build system to maven while you're at it; we know it
sucks, but we've made this choice"), even when these solutions reduced
developer and operations productivity.

On the other hand, I've also seen companies develop own infrastructure when
open source solutions met their needs very well, _again_ reducing developer
productivity and forcing these systems on engineering teams due to the sunken
costs fallacy ("yes, we know there's Protocol Buffers and Thrift, but you
_must_ use our custom binary XML parser!").

The reason Google, Amazon and Yahoo succeeded with small teams and stayed
alive during a _major_ cash crunch (the dot com crash) is partly due to
ability to make effective "build vs. buy" decisions when it came to
infrastructure (e.g., using off the shelf RDBMS for "back office" applications
like billing and order management, while building their own systems for low
latency, high throughput, high scalability applications). Companies that were
relying on shelling out for rows of expensive load balancers, Sun E450/E10k
servers and Oracle RDBMS licenses went under.

------
EricBurnett
The FriendFeed team all came from Google, however, so who knows what
infrastructure they built up right away. I think an example company a little
further from Google would serve this post better.

~~~
strlen
They've also built tornado, which is the sort of infrastructure that most
companies would shy away from building. Tornado gave them ability to quickly
build "real-time" web applications, which is what Friendfeed was about. It's
not a good example of "not building generic infrastructure" (Friendfeed is far
from the only place doing "Comet"). The rule should be to build infrastructure
when it's advantageous to do so (not earlier or later) and when it does more
(or does something specific better) than an existing open source project.

A corollary should be: generic infrastructure, when built by start-ups, should
be open source unless it brings a distinct competitive advantage.

------
rphlx
An absurd recommendation. Bandwidth/hardware costs are rapidly declining, and
already low vs developer time. Unlike 5 years ago, there's very good, free
software addressing common, high scale issues today.

Do you need to spend weeks writing your own NoSQL backend to succeed? I doubt
it. Try MongoDB or something first, and don't be so cheap that it's running
with no margin, pissing off early adopters who really wanted to love you.

~~~
strlen
> An absurd recommendation.

What did you see as his recommendation? I think he suggested _not_ building
generic infrastructure unless needed.

That being said, I definitely agree with you that you _should_ run with margin
and some fault tolerance. Not because you need four-nines availability, but
because you're wasting operations' time otherwise. I've fallen into that trap
with my first attempt at "start-up": running application hosting (before it
was called cloud computing) on minimal hardware that I'd build and physically
host myself. As a result I was spending more time on operations than on
writing code (including code to automate operations). Shelling _at least_ for
additional machines and IPKVM (not to mention for "remote hands" service at a
colo or for a managed hosting provider) would have made this much less time
consuming.

