
What powers Etsy - johnzimmerman
http://codeascraft.etsy.com/2012/08/31/what-hardware-powers-etsy-com/
======
druiid
It's very cool to see these kinds of posts. I always learn something from them
and it's always neat to see what others have built to deal with a specific
problem.

I wonder in this case why, if any specific reason Etsy is not utilizing
virtualization of one kind or another? Basically they're building dumb-boxes
that seem a much better fit to have a good number of slightly beefier VM
systems and just rebuild bad instances from images (this would be especially
good since they're already utilizing Chef).

I know that as an e-commerce firm this has helped us substantially to move
toward this kind of a structure.

~~~
Dobbs
Your point about rebuilding bad instances is just as easy to do with PXE and
Chef/Puppet on bare metal.

~~~
e12e
"Just as easy" might be an overstatement. Equally doable might be closer to
the truth.

One example where virtualization is easier in practice is where you're not in
control of the network intfrastructure -- ie: you subcontract management of
switches etc to someone else.

For someone the size of Etsy that wouldn't matter much -- they have an obvious
need to manage the entire stack. But for smaller institutions that might not
target "scaling up to serve the entire public web" -- it is a factor.

Another example is if you have heterogeneous hardware -- it is much easier to
just install a hypervisor that runs on every box and then manage that via your
vm-toolset, that set up many different images. It also makes it easier if you
need to run different os' -- do all your os images have drivers for all your
hardware?

Granted, in theory OS' should abstract away hardware -- in practice, if you
need to change out a few of your mysql instances running GNU/Linux to
Microsoft SQL servers for a new project/change in workload -- being able to
just spin down the old vms and up the new ones helps a lot in terms of
agility.

But yes, it's absolutely possible to provision physical hardware as well.

------
lordlarm
The Supermicro servers are really really great - one of the best investments
I've done so far in 2012.

I bought mine (a Supermicro 2U Dual 3GHz Xeon 4GB 12 Bay Storage Server) for
400$ on eBay, with great service and terms. Recommended if you are ever in the
need (or just want) a private server rack unit (1U, 2U or more).

~~~
druiid
I'm a big fan of Supermicro systems. They are probably one of the best bang-
for-the-buck out there, although one has to purchase them with the
understanding of the potential 'risk' of not having a giant support department
behind them like Dell/HP. That's not a knock against Supermicro, just a note
for the PHB type view.

------
mbell
I noticed something like this several times in the article: "Each machine has
2x 1gbit ethernet ports, in this case we’re only using one."

It sounds like a single switch failure could take down a large amount of their
infrastructure. The seeming lack of geographic (or even datacenter) redundancy
also seems a little dangerous at this scale.

~~~
incision
Yeah, that impresses me as a bit odd too.

I've seen situations where folks have done this because they believed they
would need VSS/VPC or similar on the switch side, but ALB on the server should
avoid that.

Perhaps they have a n array of access switches, each serving some subset to
lessen the impact?

~~~
brooksbp
VSS/VPC == MLAG... but what is ALB? I am guessing server uses one link until
failure then uses the other? Even then, there are still network issues to
overcome that may justify MLAG.

~~~
avleenvig
Honestly, we find it's much easier to spread the load between many switches,
and have enough capacity that one switch failure is a non-event. Switches are
expensive, it's true. After a certain point it does become less expensive and
complex to just add more servers and switches. There's a strong advantage to
keeping things as simple as possible. Bonding isn't really complicated, but
how many not-complicated things can you add before things become complicated?
:-)

------
lazyjones
We use those 4-in-1 Supermicro systems too, but with AMD CPUs. They seemed a
much better deal to us and have a few more DIMM slots (= 128MB per node in our
~2-3 years old boxes). The only issue was the almost-half-length PCIe slot,
it's actually 6" instead of 6.6" so you have to be careful when buying PCIe
cards (might not be an issue with the Intel version board layouts).

We've also invested in a 10gbe infrastructure rather early (we're smaller than
etsy traffic-wise) and found it to be very useful for avoiding congestion
issues when we need to transfer large files or pull data from a busy Postgres
server.

As for HP servers, we avoid them since our DL360 D6 reported obscure fan
failures when booting with an Intel 10gbe card and we had to buy the (slower)
HP NetXen-based cards instead. Supermicro boxes are not that picky and you
don't get the feeling that a vendor might be using dirty tricky to lock you
in.

------
donavanm
"Nagios checks to let us know when the filesystem becomes un-writeable" Any
specific reason you don't just panic? I've had RO fs edge cases bite me a few
times.

"In the future we may even run fiber to these machines" Anything against 10g
copper? Per port prices are getting crazy cheap.

"DNS servers, LDAP servers, and our Hadoop Namenodes ... need RAID for extra
data safety" Anything you've had issues with? Host level redundancy with the
first two has been pretty great IME, can't comment on name odes. The only
issue I've run in to is poor LDAP clients not fast failing an unhealthy server
when another is available.

~~~
avleenvig
RO filesystems can be bad, but usually they're soft failures for us: *
Memcache can still work just fine * Db servers stop responding (and the app
handles that fairly gracefully) * Web servers serve files from a RAM disk, so
they keep working

No reason against 10G copper specifically - we haven't had to address the
problem in detail yet. When we do, we might choose copper. Depends what
happens when it happens :-)

We had a very nasty incident a few months ago, where the drive in one of the
LDAP servers died. Well, it sort-of died. It started to time out a lot but
didn't go fully offline. openldap kept running, but when you connected to it,
the TCP connection would open and hang. This meant that all of our servers saw
the server as "OK", but LDAP stopped working and caused all kinds of brilliant
havoc :-)

~~~
donavanm
Interesting. I was thinking failure to persist updates, inconsistent health
check results, those types of nasty corner cases.

Sure. Premature optimization and all.

Huh. SSD from a major chip vendor perhaps? I had one of those bite me just a
few weeks ago. in my case it would have been fine if the drive offlined
itself. but the intermittent io pauses killed performance, but not health
checks. Actually, same type of problem RO FSs make me scared of.

------
chrismealy
Anybody know what they use hadoop for?

~~~
avleenvig
Tons of stuff! Stay tuned for that in a future post :-) We posted about this
in 2010: [http://codeascraft.etsy.com/2010/02/24/analyzing-etsys-
data-...](http://codeascraft.etsy.com/2010/02/24/analyzing-etsys-data-with-
hadoop-and-cascading/)

Probably time for a refresh!

\- Avleen Vig, Staff Operations Engineer, Etsy

~~~
chrismealy
Cool! Thanks.

------
EliRivers
It's powered by love of crafting stuff :p

~~~
avleenvig
If I could hand-craft circuit boards and CPUs, I probably would :-)

