
180TB of Good Vibrations – Storage Pod 3.0 - BryantD
http://blog.backblaze.com/2013/02/20/180tb-of-good-vibrations-storage-pod-3-0/
======
bcantrill
I think it's worth (re)reading this comment on the original Storage Pod from
several years ago:

[http://www.c0t0d0s0.org/archives/5899-Some-perspective-to-
th...](http://www.c0t0d0s0.org/archives/5899-Some-perspective-to-this-DIY-
storage-server-mentioned-at-Storagemojo.html)

I have essentially have all of the same questions for Storage Pod 3.0 -- and
in particular, what does the software stack look like? (This config is
absolutely begging for ZFS, but I have a haunting feeling that something janky
is afoot upstack.) I would also be curious as to the specific nature of
failures that have been seen with the deployed architecture. Have the concerns
from three years ago proven to be alarmist or prescient?

That said: I think it's very valuable to get configs like this out there for
public discussion -- and I think it might be inspiring us (Joyent) to
similarly publicly discuss our own high density storage config...

~~~
brianwski
Disclaimer: I work at Backblaze. I'm not technically on the server team, but
here is what I understand: we have 450-ish Backblaze pods (each with 45 hard
drives) deployed in the datacenter. We are JUST NOW starting to see some old
age mortality (increased failures) of the drives we deployed about 4 and a
half years ago. We're really happy with the longevity, it exceeded everything
we were told to expect.

We group the drives into 15 drive RAID6 groups, where there are 13 data drives
and 2 parity drives. This means we can lose 2 drives and not lose any data in
that particular RAID6 group. We use the built in Linux "mdadm" tools to do
this.

The network interface to a pod is through HTTPS talking with Tomcat (Java web
server). Java writes the data to disk (ext4 on top of the above RAID6). Our
application (backup) is very specific and performance forgiving, essentially
we write data once and then re-read it once every few weeks and recalculate
the SHA-1 checksums on the files to make sure the data is all completely,
totally intact and a bit hasn't been thrown somewhere.

One of the "luxurious" parts of working at Backblaze is we own BOTH the client
and the server. On a customer's laptop, the client pre-digests the data,
breaks it up into chunks that make sense (more than 5 MBytes and less than 30
MBytes) and then the client compresses it if appropriate (we don't compress
movies or audio because it would be silly wasted effort) and the client
encrypts the data, then sends it through HTTPS to our datacenter. Because the
client computer is supplied by customers, all their CPU cycles are "free" to
us. We can conveniently break up files, encrypt them, deduplicate (within that
client) all without spending any CPU cycles at Backblaze because it is done on
the customer's laptop before being sent.

Again, the Backblaze storage pods really aren't the correct solution for all
"off the shelf" type IT projects. For example, it won't meet the performance
needs of many applications. But it does work exceptionally reliably in our
experience as a backup solution when you have one or two programmers to help
implement a custom software layer in Java.

~~~
vincentkriek
Wow, thanks for the explanation! I would love to learn more about the software
you guys use!

One specific question, how do you know if the checksum is correct? Do you keep
a database of checksums stored on a specific pod? And if the checksum is not
correct, do you have other copies on other pods?

------
vardump
I think it should use ECC memory and an i3-CPU that supports it. Random memory
bit flips in are going to corrupt data at a steady pace.

Intel i3 processors that support ECC:
[http://ark.intel.com/search/advanced/?s=t&FamilyText=2nd...](http://ark.intel.com/search/advanced/?s=t&FamilyText=2nd%20Generation%20Intel%C2%AE%20Core%E2%84%A2%20i3%20Processors&ECCMemory=true)

Also, it'd be interesting to hear why Backblaze doesn't use SuperMicro SAS-
boards instead with a SAS-expander, like HP SAS Expander.

~~~
brianwski
Oh, about the "random memory flips" -> in our particular application, the
client running on a customer's laptop encrypts the data then calculates a
SHA-1 checksum THEN transmits the file through HTTPS to the pods. The pods
write it to disk with the checksum there. Once every couple of weeks we re-
read the file and re-calculate the SHA-1 checksums. If there was ever a
problem, we would detect it. These turn out to be VERY rare, but they do
happen where a file is fine for many years then a bit is flipped "on disk" (we
don't think they are in the RAM, but it doesn't matter, it is an "end-to-end"
check). We assume this is happening in consumer systems also, but at the rates
we see it would be undetectable in consumer's worlds (1 bit per customer
lifetime - it would probably create a tiny mis-spelling in a MS Word document,
or maybe one pixel would be wrong in one JPEG).

~~~
krenel
Or 1 bit flip could corrupt and entire 128 bites block of AES encrypted data.
Openssl would complain when trying to decypher the file giving a "bad magic
number" error.

BTW, keep up the great work guys!

------
kyrra
The serviceability of those things looks not pleasant. I used to work in the
storage industry and got to play with (what is now) NetApp high-density setups
[1]. 60 drives in a 4U setup, compared to 45 drives in Storage Pod in the same
4U. But I'm guessing the cost is where Storage Pod really wins out. NetApp
gear, even as a JBOD, is really expensive.

The NetApp box has same type of padding for all the drives, but they are much
easier to access (pull out trays are stable and easy to use).

Fun issues I saw with the NetApp box (at least 3 years ago): fully loaded with
drives, it went over the weight limit that Fedex or UPS would ship with
standard shipping. It required freight shipping to ship a single, full-loaded,
E5400.

[1] <http://www.netapp.com/us/products/storage-systems/e5400/>

~~~
vnchr
Storage hardware startup here. We can do 72 drives in 4U for standard racks
and 120 drives in 4U for deep racks, but it is difficult to service so we're
only pushing to early adopters. However high-density servicability can be
addressed. Our next system should have same drive numbers in 5U, much easier
to swap drives. Not much can be done about the shipping problem, ha ha.

~~~
atYevP
And what hardware startup might this be? I'm curious ;-)

~~~
vnchr
Sup Yev! Evtron. We're in St. Louis. We have a call with Gleb on Wednesday.
Love to chat some time: ivicars@evtron.com / @israelvicars.

~~~
atYevP
Hah, well I'll leave you guys to it ;-)

------
recuter
"Our monthly cost for a full rack of Storage Pods with 3 TB drives is $0.63
per TB, while a full rack of Storage Pods with 4 TB drives is $0.47 per TB. "

Interesting coincidence, 180TB of Google Storage comes out at $8560 per month
which is $47.55 per TB. Almost exactly x100.

<https://developers.google.com/storage/docs/pricingandterms>

And that does not include the network cost.

~~~
budmang
Well either they're making 99% gross margin, or perhaps they need a few
Backblaze Storage Pods.

~~~
EwanToo
The very high gross margin is more likely - they're charging as much as they
can, not as little as they can.

------
tobiasu
Unrelated, from their FAQ:

"Look, I'm an Advanced User, and I Already Have a Set of RAID Drives with Perl
Scripts to Copy My Files Back and Forth Between My 18 Home Machines that are
in a Datacenter I've Built in My Closet. Why Do I Need Backblaze?"

Made me smile, probably because I have more machines than m^2 in my
apartment...

~~~
joshAg
And yet they don't have a *nix client...

------
ChuckMcM
I love a good storage story. Its interesting that they still put them behind a
couple of gigabit network ports. Using the native network (2 x 1GbE) it would
take more than a week with those interfaces on full dump to get a full load
off or on to a pod.

I had an ultimately unrewarding conversation with Sean Quinlan (of Google GFS
fame) about the futility of putting a lot of storage behind such a small
channel (in Google's case the numbers were epicly Google of course but the
argument was the same). You waste all of the spindles because the operation
rate (requests coming into the channel) vs the amount of data ops needed to
satisfy the request, basically leave your disks waiting around the next
request to come in from the network. (btw that allows you to make a nearly
perfect emission rate scheduler for disk arms but that is another story).

What this means is that petabyte pods are going to be nearly useless, although
with an external index they can be dense.

~~~
mikeash
I could see it being a problem for Google, but Backblaze wants these for
archival purposes, not something where there is going to be a lot of reading
and writing. The write rate is going to be whatever speed their users upload
stuff, divided by the total number of their storage pods. I assume this is
relatively small. The read rate is going to be whatever speed their users
download restores, divided by the total number of storage pods, which is
probably much smaller still.

The assumption here is that data is kept for a long time relative to how
frequently it's written and read, so the IO speed probably isn't that big of a
deal.

~~~
donavanm
No. As you said port speed doesn't matter for data at rest. What matters is
ingest/exfil of data due to "exceptional" conditions. Prime cases are
cluster/mirror failure. Remirroring existing data to another pod is port
limited, as is ingest for pods that are remirror targets.

~~~
mikeash
Is there any reason the sources and targets couldn't both be thoroughly
distributed throughout the cluster? Nothing says hard drives have to be
perfectly replicated, you just need multiple copies of the _data_. I'm
imagining that a HD dies, and the extra copies of what it contained are
scattered all over. You re-replicate them by scattering them further all over.
No one pod has to move any substantial amount of data.

~~~
donavanm
Sure. You can absolutely replicate chunks. But you start kicking the problem
upstream. A rack down is a couple pb, so you start doing a ton of cross rack
transfers to get your replica counts back up. Now you're gated on nic/TOR/agg
switch throughput. A DC down and you're gated on nics TORs Aggs & intra DC
network. And this keeps adding up $$$ the further you get.

Ms had an interesting paper on data locality in storage last year. Can't
recall the title offhand though.

------
johnmwilliams
I am curious where they purchase 4TB disk for $195. That is less than half of
most places. That's quite a price break, even if buying in bulk.

~~~
danso
Seagate 4GB external drive is selling for $180 right now:
[http://www.amazon.com/Seagate-Backup-Desktop-External-
STCA40...](http://www.amazon.com/Seagate-Backup-Desktop-External-
STCA4000100/dp/B00829THLE/ref=sr_1_2?ie=UTF8&qid=1361379018&sr=8-2&keywords=4tb+drive)

Yes, probably not same speed specs, but it's external, which means it usually
costs more for the convenience of the housing.

Edit: Unless my math is wrong, the 3GB model is a better price point...$40 for
a TB as opposed to $44+/TB for the 4TB version.
[http://www.amazon.com/Seagate-Backup-Desktop-External-
STCA30...](http://www.amazon.com/Seagate-Backup-Desktop-External-
STCA3000101/dp/B00829THQE/ref=sr_1_8?ie=UTF8&qid=1361379272&sr=8-8&keywords=4tb+drive)

~~~
atYevP
Yev from Backblaze here -> We're not above liberating drives from their
enclosures:
[http://blog.backblaze.com/2012/10/09/backblaze_drive_farming...](http://blog.backblaze.com/2012/10/09/backblaze_drive_farming/)

~~~
astrodust
Husking external drives seems like a great way to get a deal.

~~~
keidian
I've done it myself for my home servers. Tends to get strange looks, I can
only imagine what buying in the scale BackBlaze did caused for reactions :p

------
shocks
It's crazy that hard drive prices are still higher than they were two years
ago because of the flooding...

~~~
atYevP
It's not _THAT_ crazy, the drive manufacturers saw demand stay constant while
supply dwindled, there response was to raise prices across the board. They
really have no need to drop them again, as quickly as they had been before the
flooding. As soon as the first major producer drops prices back to normal
levels the others will follow suite, but no one wants to be the first, since
people are still buying them at inflated prices.

~~~
shocks
I understand the law of supply and demand, I am merely expressing surprise
that the effect has lasted for over two years. :)

------
Serow225
Disclaimer: I work at a company that makes hard drives. Don't forget about
their sensitivity to rotational vibration, most often induced by coupling from
other drives. Maybe it's not important in this particular application, but it
can be a performance killer. To avoid it, 'vibration absorption' is not always
the name of the game :) This link has some good background info, and a
reminder that there are actual hardware/firmware reasons why enterprise &
near-line drives are more expensive than consumer drives:
[http://enterprise.media.seagate.com/2010/05/inside-it-
storag...](http://enterprise.media.seagate.com/2010/05/inside-it-storage/are-
all-hard-drives-created-equal-examining-rotational-vibration-in-desktop-vs-
enterprise/)

~~~
mctx
It sounds like they've got enough redundancy that a single drive failure (or
even an entire pod) isn't an issue - they use cheap drives and simply swap
them out when they fail.

I wish Google's hard drive failure analysis [1] included failure rates and
failure scenario statistics for different models, vendors and
consumer/enterprise classes too.

[1] <http://research.google.com/archive/disk_failures.pdf>

~~~
Serow225
Knowing Google, I'm pretty sure that their drive population was consumer-grade
- I think they mentioned that in the paper. It was interesting that the paper
lamented the usefulness of SMART in predicting failures, because one of the
things that enterprise/nearline drives buy you is much much richer SMART
diagnostics. I wonder what their results would have been with drives that had
better diagnostic reporting. I'm also curious if folks that are building
'cold-storage' on the cheap have looked into using DVR drives, they might have
some useful characteristics for this application.

------
craigyk
I bought a used HP MDS 600 on EBay for less $2000 shipped. It doesn't have a
computer built in, but unlike the Backblaze it has four power supplies, two
redundant and four fans and two SAS interfaces. It also holds 70 drives. I've
thought of building a Backblaze before, but if you want to meet your storage
needs with a single enclosure, there are better solutions for the money. The
Storage Pod is really for when you have enough of them to consider any single
one redundant.

~~~
jacquesm
Obviously a one-off is always going to be source-able from second hand bits
and pieces. But if you need to build a these on a regular basis and you need
to maintain them then uniformity and slightly higher cost are not an issue.

Nice deal though!

------
venomsnake
Isn't the "use same drives" advice a little dangerous? If you have bad luck
with firmware or specific batch of drives - you could be put in a world of
pain fast.

~~~
atYevP
Yev w/ Backblaze here -> This is true, and we've had bad batches before, but
all of the drives and chassis are tested before we put them in to production,
having arrays with the same drives minimizes the variables throughout the pod.

------
gadders
Based on not much other than their blog postings, I really like the (for want
of a better word) _vibe_ of this company.

------
digitalzombie
I wish they support linux

~~~
alexkus
Eh? What makes you think it doesn't already?

~~~
astrodust
The pod itself does but the BackBlaze client software for the BackBlaze
service does not.

~~~
alexkus
Sorry, yes, I was thinking about support in relation to the pod itself (the
subject of this article) and not BackBlaze in general.

------
Keyframe
I'm interested in maybe having a couple of those boxes for video data, to keep
it online for editing bays. What would be the best solution to backup that
amount of data - redundant boxes?

~~~
beernutz
Yes, redundant boxes would probably be the best bet for data of that size.
Remember though, raid is NOT a backup. When possible make sure to send copies
to multiple places!

------
rdl
I wonder when the Hitachi 7K4000 4TB 7200rpm drives will hit $150 in quantity
100 or less. I think they were $180 on Black Friday 2012, and $210 recently.

------
andrewcooke
whoa. 2.5" (boot) drives more reliable than 3.5". why?!

~~~
moe
I wonder why they use a boot-drive at all instead of netbooting these things
(or use a pair of USB-sticks).

~~~
brianwski
Disclaimer: I work at Backblaze. We use the boot drive just because it's a
tiny bit "cleaner" to boot off a separate drive (and configure swap on that
drive) and have all the "data" alone on the 45 drive array. It allows you to
boot and THEN reassemble a failed raid array that is on other drives, stuff
like that. You could do it off of USB sticks, but I'm not sure about how well
that would work with swapping? Maybe it would be fine? In the end it hasn't
been worth focusing on yet, the price difference would be really super small.
We spend most of our brain cells trying to find cheaper hard drives which
account for 80 percent of our costs.

Netbooting -> we're actively looking into that as an option. But you still
probably need a local swap drive.

~~~
moe
You don't normally need a swap-drive on a storage node (if it needs to swap
then something is wrong to begin with), but of course I don't know the details
of what your nodes may do beyond dealing out files.

The main advantage of PXE-booting would be maintainability (rolling out
upgrades by a simple reboot, etc.) but I assume at your scale you have that
already figured out in one way or another.

Either way, thank you for all the insights that you keep sharing with the
public! These hands-on blog posts are priceless both in entertainment and
education value. :-)

------
purephase
That's a pretty impressive price point.

