
Why we built a 40TB photo server in-house instead of using S3 - benhoyt
http://tech.oyster.com/how-to-build-a-40tb-file-server/?
======
ivan78
"The one most valuable asset at Oyster.com is our photo collection. ... In
strict accordance with KISS methodology, we opted against LTO and S3, and
decided to build a big BOX."

I can only imagine how much they will be scared each time they need install
updates or reboot THE BOX. They will eventually decide to build identical BOX
and mirror their data on daily basis. Then they will notice that mirroring
such big volumes of data is wasting tooo much system resources and start
evaluating in-house distributed storage solutions, such as OpenStack Swift.
Then they will notice it is way too overcomplicated and finally decide migrate
their data to Amazon S3.

I'm writing it as a person who walked the same path over the last few years.
:-)

------
mbell
This is honestly pretty scary. There are a lot of single points of failure in
this solution.

1) Single Box

2) Single Location

3) Single 40TB RAID 6 Array on single RAID card with 22 Drives (assuming 24
2TB drives, 2 parity, 2 hot spare = 40TB)

4) Single bonded network link means single switch, no redundancy against
switch / network device failure

Honestly this may have been cheap but your getting what your paying for: An
unreliable backup solution.

Also: Norco? Really? I wouldn't be trusting anything they produce with your
company's critical data. Supermicro isn't much more expensive all things
considered.

~~~
dspillett
Warning, the sound of a broken record coming up...

 _> An unreliable backup solution._

Nope. _RAID is not a backup solution_. It provides redundancy so the array can
survive an event like a device failure (and so the data survives as a
consequence) without significant downtime for repair (with zero downtime if
you have hot-swap hardware) but it does not, and is not intended to, protect
the data from the huge list of other things that can affect it.

RAID is redundancy for reliability purporses, not backup purposes.

~~~
jemka
>Nope. RAID is not a backup solution.

OP isn't saying RAID is a backup solution. RAID is being used as a component
within the backup solution (the server).

~~~
dspillett
Ah, yes. Sorry.

My knee has been playing up and jerked a little there.

RAID is certainly a valuable part of many a more substantial backup solution.

------
moe
_"The first challenge in putting together the big box was getting internal SAS
connectors properly seated into the backplane adaptor sockets"_

Excuse me?

As a customer I'd feel slightly uncomfortable about my data by now. You make
it sound like stuffing 24 disks into a box is rocket science to you, all the
while SuperMicro and others sell plug&play chassis for up to 45 disks[1].

Also, you didn't mention it in the post, but you _do_ have at least two of
these in two physically distant racks, right?

[edit: deleted snarky comment about people running windows on a fileserver]

[1] <http://www.supermicro.com/storage/>

~~~
wx77
At least for this website it isn't your data but it is their data.

They are backing up their curated photos (which seems to pretty much be the
whole business as they say in the article).

It appears they are using Akamai services to server their images so hopefully
their is some extra redundancy there already.

Other than that it seems like it would be a steal to use s3 as a backup system
at this point because from this article it looks like they need to hire
another employee to tell them why this backup solution seems a bit silly (I
mean having trouble setting up the hardware).

------
smerritt
We had something similar but smaller (~8 TB) at a place I worked, and it was a
nightmare. Migrating from that to S3 was one of the best things to happen to
that project.

Being a single big box, it had a bunch of single points of failure, and boy
did they fail; we probably had 5-10 hours per month of downtime due to the
photo server falling over (flaky RAID controller firmware, mostly).

Also, since the big box was expensive, we only had one in production. There
was code for taking a newly-uploaded photo and copying it over to the photo
server that only executed in production, which meant the only way to
functionally test it was to ship it and hope.

We switched to S3 about a year ago with different buckets for prod, staging,
and dev; the production-only code paths went away, and there hasn't been any
photo-related downtime since. Definitely worth it.

~~~
vilda
This sounds like you had a really bad implementation. Proper file server of
this small size would not fail for several hours per month.

~~~
smerritt
Absolutely true. The RAID controller would randomly lose drives and the driver
for it would randomly cause kernel panics. We tried different firmwares and
different kernels and made some progress, but never really got it stable under
load.

However, that's the risk you run with single points of failure. Put all your
data on one big box, and any failure in your RAID hardware, RAID firmware,
RAID drivers, network drivers, kernel, RAM, OS, et cetera will take down the
big box and thus take down anything relying on it.

The lesson I learned wasn't to make a super-robust single system, it was to
have enough redundancy to stay up when something inevitably fails.

------
rdl
For a 24 drive server, I'd just get a heavily discounted dell or HP box. A
startup should be able to pay half list, buy two, and be ahead vs. s3.

Supermicro chassis re a big improvement overdoing your own wiring. The Areca
controllers, especially in raid 6, re great. For raid 5 I'd also look at
3ware.

For single gig e, you can get away with esata expanders, building something
like the back blaze pod. I've done that kind of thing for personal use, and to
have an onsite mirror of something, but I'd want several, in several different
colo facilities, to compare with s3. The exception is if you need some kind of
scratch storage, but even refilling a 40TB archive with downloadable content
takes a really long time over a 1Gbps link.

I'd build a few boxes like this now, but the Thai floods pushed hard drive
prices up to the point I have to wait. Hopefully fast 4-5TB drives will be
$200 by summer 2012.

------
jwatte
S3 gives you multi-host, multi-region redundancy. Putting it all in one box is
asking for trouble. What if the raid controller grows a bug and corrupts on
write? It's happened. What if there's a fire in the building that has both
your server and your back-up? Eggs, meet basket!

~~~
JoeAltmaier
S3 fails too, right? Didn't they have an internal network issue this year, and
go down for hours?

Anything you haven't tried, doesn't work. That's a truism in computing. I
don't think Amazon tries failing-over entire data centers very often (have
they ever?), so when it needed to happen, it didn't work.

Anyway, I'm thinking this guy has only to back up his photo store about once a
day to (something big) and put it in his bank box, and he's good to go, at
least for a photo site.

~~~
jodrellblank
_Anything you haven't tried, doesn't work. That's a truism in computing._

(Aside) I wonder if this is more accurately stated: Anything you haven't tried
recently, doesn't work. Even if you have tried it recently, that's no
guarantee.

Or, to paraphrase Hofstadter's Law, the probability of something working some
time after you last tried it drops surprisingly quickly, even when you take
into account this rule.

------
jackowayed
> _In strict accordance with KISS methodology_

Buying a ton of parts, carefully assembling them, and having it be your
problem when something breaks is simpler than paying Amazon to solve the
problem nearly perfectly?

~~~
rorrr
Better than going bankrupt.

~~~
res0nat0r
You mean, better than going bankrupt until they go bankrupt when the single
BOX goes down, right?

~~~
rorrr
That single box is made of replaceable cheap parts. Only a fire can take it
completely out.

~~~
gujk
My company's data center has suffered a fire at least once in the past 5
years.

~~~
Jugglernaut
That sounds like a terrible data center. If you count a server short circuit
with some minor smoke as a fire I guess it's understandable. The server room
should have Halon/Inergen or a similar system with smoke detectors.

Building your own solution to beat S3 is certainly viable but at this scale I
doubt it.

------
blrgeek
After reading that I'm afraid they're going to have downtime because of a
'shoddy backplane connection' sometime soon. Or that it's going to fall out of
its 'delicate balance' or have a 'driver conflict' soon :(

I wonder if they subconsciously undervalue their photos, or if this is just a
naive 'we can do better' moment.

I'm all for building your own, but there's a good reason enterprise server
hardware costs more, and there's a good reason to go for enterprise hardware
when you say 'The one most valuable asset at Oyster.com is our photo
collection'.

For instance a Dell or HP storage server with 24 disks would be around 18K
list - and be really engineered for that as opposed to hacked together.

------
fbuilesv
_For starters, 40TB on S3 costs around $60,000 annually. The components to
build the Box — about 1/10th of that_

I wonder why no one ever factors the cost of having a knowledgeable person
handling the system into their calculations. TBH 40TB doesn't sound as much,
but once you start growing you'll want someone familiar enough with the
subject to take care of it (especially if it's their most valuable asset).

~~~
aaronjg
Amazon S3 price also takes into account having redundancy at three data
centers. So you should multiply the cost by 3, just for that. Of course if you
don't need the redundancy, you can build it for cheaper.

Also for some applications it really does make sense to move away from S3, and
have a solution in house. For example The Broad Institute has about 6
petabytes of storage [1]. They in particular benefit from local storage, since
all of their data is generated on-site. However, even at this scale, they
don't build the boxes themselves [2].

[1] <http://www.genome.gov/27538886>

[2] [http://www.isilon.com/press-release/isilon-iq-powers-data-
st...](http://www.isilon.com/press-release/isilon-iq-powers-data-storage-next-
generation-dna-sequencing)

~~~
spydum
Also quite important to mention, use of S3 will chew up your upstream
bandwidth while making backups. Also, it will affect your recovery time
objectives if you ever needed to run a restore (you will NOT be getting
200MByte/sec to or from S3, unless you have some sweet network connectivity).

------
bretr
the author updated with this comment:

[http://tech.oyster.com/how-to-build-a-40tb-file-
server/?#com...](http://tech.oyster.com/how-to-build-a-40tb-file-
server/?#comment-211)

"Didn’t mean to give the impression that this is the _only_ backup, it is not.
It is the “warmest” one — first line. It does not share the same physical
location with the primary storage box either, but is less than 2 miles away.
So we can easily have a 2 hour recovery time without throwing away those
astronomical monthly service fees. (Although many technologists will always
prefer paying big bucks for the comfort of being cushioned from every angle by
SLAs and such — nothing wrong with that, just a different approach.)

As far as TCO goes, it cannot get any lower since we already have one or two
system guys handling _all_ servers, as well as office workstations, etc.. This
backup box takes up such a small fraction of their time that its almost
negligible — several thousand annually at most. Same goes for power, etc — it
is just one of _many_ servers.

The disks are all Enterprise Class 2TB SATA-II, several different models. We
were purchasing them right after the monsoon floods in Thailand constricted
supply so our choices were somewhat limited as time was a factor.

Raid6 has come a long way since it’s early inception days, but is still a
trade-off between raw storage capacity and processor utilization. HW RAID
industry is now old enough to not have to wait for new products to mature as
we used to when the technology itself was in its infancy. Old habits certainly
die hard, but getting the “latest and greatest” was a conscious choice made
for this specific problem, not submission to some immature fascination with
“elite” new products, or however that may be.. This card has the best specs
for Raid6 currently on the market — bottom line, period.

Big Kudos to all who made suggestions and participate in the discussion, keep
it coming!"

------
wmf
From my experience building storage, you're better off buying an enclosure
that has expanders (e.g. Supermicro); it really simplifies cabling.

~~~
rhizome
Supermicro is just such a fantastic company. Love their stuff.

------
nodesocket
Static image storage makes the most sense todo on S3 or similar. Building your
own storage, does not provide the redundancy and reliability of S3.
Additionally, you have the flexibility to enable CloudFront and distribute the
images via CDN if you need.

~~~
meroliph
Building your own storage _can_ provide the same redundancy and reliability.
You can still use a CDN as well.

~~~
res0nat0r
Sure it can, but can building and administering your own solution compete with
a TCO less than S3's $0.14/GB with a 99.99% availability?

~~~
meroliph
Of course it can, given how many useful open source solutions are available
nowadays.

Considering the total cost of 6000$ for a single server, you would be dropping
12k from the get-go for two servers (though you should really lease over a 12
month period if you can't afford it) and then you need to add colo costs for
the machines, which can be around 300-400$ per machine in a single unit
colocation environment, depending on your location. This brings us to 21600$
in the first year for two machines colocated in separate datacenters, without
any sysadmin costs;

The article states S3 would cost 60,000$ per year, and this doesn't include
any bandwidth costs, whereas the colocation setup would include some decent
bandwidth per month. Over time, it's easy to see how a lot of money can be
saved.

Also, S3 is _designed_ for 99.99% availability, but they only guarantee 99.9%
through SLA, which isn't extremely spectacular. Hitting 99.9% isn't
particularly hard if you have a sys admin worth his salt to set things up
right, and with your own solution you can have more direct access to the file
system as well as the ability to adhere to any regulations.

------
16s
A bit off-topic, but I wonder if anyone can comment on using software RAID
rather than hardware RAID? I'd like to try it. I've been bitten by buggy
hardware RAID controllers far too often (even high-dollar name brand gear). I
know that all of the free *nix systems offer software RAID, I'm just curious
how they perform.

~~~
mbell
I can't comment for industrial usage but for use at home (work and non-work
use) I have the follow setup for storage:

My old desktop hardware (Intel Q6600, 8GB RAM, Asus MB, pair of gigabit links
bonded)

Supermicro 4u tower case with 8x hot swap bays + the 5.25 bay filled with a 5x
hot swap cage, 13 total hot swap bays.

3-Ware 9550SX RAID Controller, 4 x WB RE 320GB drives, RAID 5.

8 x 1.5TB "Green" Drives, mixture of WD and Samsung drives, RAID 6 using mdadm
(linux software raid)

1 x WD Raptor (system drive)

Ubuntu with KVM for virtutalization

Originally i was in the "must have raid controller" camp which is when I
bought the 3-ware controller and the RE series drives. When that array filled
up I did some research and decided to just go the mdadm route and have minimal
complaints so far. I use the 3-ware array still for "critical data" and back
it up to S3.

Monitoring: You have to work a bit more to get proper alerting of issues from
mdadm but its not hard to setup. Doesn't matter for me much, i sit in the same
room as this server most of the day working so if something goes wrong i
generally notice before i get the e-mail.

Performance: As mentioned I'm using Green drives, this system wasn't built
with speed in mind but rather large amounts of nearline storage. Never-the-
less, with some basic tweaking and making sure the array's partition alignment
is correct I get around 450MB/sec read speed and ~85MB/sec write long term, I
have the system setup to cache writes aggressively however as most of the data
on this array isn't critical and its on a UPS, what this means is that writes
under a few gigabytes usually complete at wire speed (~200MB/sec) then get
flushed to the disk later. Most of the time I'm limited by network bandwidth
to this system, unless I'm writing a very large amount of data all at once.

One negative is rebuild speed, here i'm very limited by the 'Green' drives I
believe. It runs at about 50MB/sec so rebuilds do take awhile.

As far as CPU usage goes, I've never seen it be the limiter but haven't
watched it that closely, it doesn't peg during rebuild. This machine acts as
an SMB/NFS file server and a few development VMs 24/7 (database, and a couple
other things) and I've never really had an issue with cpu usage.

One really nice bonus is that if something in the system fails, you can just
plug the drives of your array into to almost any other linux system:

apt-get install mdadm

mdadm --assemble --scan

Poof, working array.

tl;dr

If your going after 1GB/sec transfer speeds, get a high end raid card.

If you just need some large redundant storage that can saturate a Gbit link or
2, then mdadm software RAID is just fine IMO.

------
cagenut
Yes S3 can get expensive, but imho this swings the pendulum too far in the
other direction. Something like a riak cluster of four 2U/24-drive servers
would get you the cost structure of good colo but the
features/resilience/operational-flexibility of something more like s3.

~~~
lucaspiller
Good suggestion. I posted on the Riak mailing list regarding this:

[http://lists.basho.com/pipermail/riak-
users_lists.basho.com/...](http://lists.basho.com/pipermail/riak-
users_lists.basho.com/2011-December/006955.html)

------
dfrankow
How much more expensive is a clustered software solution (e.g., Hadoop FS)
than this RAID box?

~~~
Jugglernaut
That depends on how much hardware you want to spread it out on. Many companies
would have to hire a hadoop guy, then again some companies would have to hire
a sysadmin to to run the Oyster solution. Build your storage to fit your
company.

------
papercruncher
How are you dealing with bit rot? Are you periodically scrubbing the data to
give the controller a chance to repair or are you waiting to get a URE during
an array rebuild? Are you running end to end checksums against all your data
to protect against bad firmware, bad ram, etc. What is your mean time to
repair in case you lose a drive?

One more question: you saturated the network link with a sequential
read/write, but is that how you actually store the data? If not, how long
would it take you to be up and running on another CDN in case Akamai goes down
in flames?

------
zorg
"Having spent some extra time on research, fine-tuning, and optimizing the new
server, we were glad to find that the gigabit network had became the
bottleneck"

This matches my experience, RAID performance in linear IO is an order of
magnitude below what the disk should allow. This guy is relieved to finally
get 1 gig of useful bandwidth out of 24 disks (about 1 gig a piece). So it's
no faster than a single disk.

(i know it's linear io in this case because the screenshot shows a large file
copy)

------
tomkarlo
This seems like a half-solution - I can understand building a big box locally
for day-to-day access to the images (if they indeed need that) but I didn't
see any mention of an off-site backup. Even assuming no problems like file
corruption, what happens if there's a fire or flood? At the least, there
should be two copies of this big box, in different places.

------
forensic
Off site backup conveniently left out of the write up?

------
latchkey
Hey everyone, don't worry, be happy! Their VP of engineering used to work at
'a startup' and prior to that, he was a 'rocket scientist' because he worked
at Raytheon on missile guidance systems. Oh, and before that, he worked at the
Mothership, I mean Microsoft, in the 'user experience team'. As an added
bonus, if you want to work at Oyster, you get your choice of such cutting edge
technologies as 'Python, PostgreSQL, Nginx, Windows, CentOS, C++ and more'!

Nothing to worry about here, I think we are in good hands.

