
Storage Pod 6.0: Building a 60 Drive 480TB Storage Server - geerlingguy
https://www.backblaze.com/blog/open-source-data-storage-server/
======
nocarrier
It's always great to see storage hardware being opened up, thanks Backblaze.

I have a question after reading the Vault overview here:

[https://www.backblaze.com/blog/vault-cloud-storage-
architect...](https://www.backblaze.com/blog/vault-cloud-storage-
architecture/)

I'm curious what the bandwidth demand is when an entire host fails and you
have to drop a new replacement host in, since you would have 60 Tomes in the
Vault that need to be rebuilt at once. You have at least two other parity
copies when a host dies (assuming no other drives in the affected host's Tomes
are dead on other hosts in the Vault set), so I'm guessing you can afford to
wait the handful of days it would take to rebuild the host. I'm still curious
to know what rebuild speed you plan for, since I'm guessing you'd be looking
at 40Gbps NICs eventually. I was surprised to see only 2x10Gbps.

~~~
brianwski
Brian from Backblaze here. If we do a full pod swap for a blank, we are disk /
CPU limited in the rebuild, not 10 Gbit/sec limited. We have the rebuilds up
to taking about 3/4ths of the 10 Gbit/sec link but we continue to try to
change the software and buffering to get this higher.

It's MUCH better to lose only one drive, because we distribute the rebuild CPU
task across all 20 CPUs in the tome and we can get it synced back up in a few
hours. All the chattering back and forth still doesn't come anywhere close to
maxing out the 10 Gbit.

For uploads, each individual pod has a 10 Gbit/sec connection, so ACTUALLY the
vault has an aggregate of 200 Gbit/sec facing towards the internet. However,
every byte that a pod accepts it has to ALSO retransmit that internally so
that cuts the actual theoretical upload rate to 100 Gbit/sec for a properly
parallelized application.

~~~
scurvy
> If we do a full pod swap for a blank, we are disk / CPU limited in the
> rebuild, not 10 Gbit/sec limited

Wow. This is pretty shocking TBH. My basic ceph nodes will crush a 40gbps link
doing a rebuild/rebalance. What's your bottleneck?

~~~
mappu
Backblaze use an in-house Java pseudo-RAID layer based on Reed-Solomon codes,
with 17 data shards and 3 parity shards. There are some performance numbers
here [https://www.backblaze.com/blog/reed-
solomon/](https://www.backblaze.com/blog/reed-solomon/) (2015).

~~~
nwmcsween
I wouldn't mind understanding why, I'm assuming at the time ceph wasn't mature
enough (and still somewhat is due to btrfs, yes I know it's optional).

~~~
scurvy
btrfs isn't a requirement nor really a recommendation for production ceph
clusters. It was really only there for weirdo benchers.

Bluestore and rocks DB really changes the back-end story altogether. No more
POSIX filesystem junk with journals and such.

------
daveguy
Hey Yev or Brian,

First I think it is awesome that you guys come on HN to respond when backblaze
makes the news. Thank you.

If you bought a prebuilt storage pod (from somewhere like
[http://www.backuppods.com/](http://www.backuppods.com/)) could you start out
at less than capacity and add more as you go? Say you started with one drive
-- would it need to go in a specific location? When you add more would there
be a significant reconfiguration process? Is it possible to hot-expand? I
guess location and config has a lot to do with software involved. Has there
been a blog on the software side of managing that much storage? Can it be as
simple as a spanning partition to treat all drives as a single drive?

Also, are you tracking the cost of SSD drives? Do you expect cost parity soon?
It seems that capacity parity is getting close, but cost parity (especially at
that capacity) is farther off. Do you have an estimate of when it might be or
is it "far off, check again in 5 years". I'm sure you all found the ssd
endurance test at techreport interesting
([http://techreport.com/review/27909/the-ssd-endurance-
experim...](http://techreport.com/review/27909/the-ssd-endurance-experiment-
theyre-all-dead)).

~~~
brianwski
Yes, and we do this internally in our QA and load testing lab. It's an
expensive waste to put $200,000 worth of drives in a 20 pod vault when all you
need is 1/60th of that!

> are you tracking the cost of SSD drives? Do you expect cost parity soon?

Yes we track SSDs, and this is a hotly contested topic inside Backblaze.
Personally I'm predicting they "cross over" in cost effectiveness within 2
years. But I don't have any more insight than you do, and many inside
Backblaze disagree with me.

When I say "cost effective" that is total cost of ownership, which includes
paying for less electricity to power the SSD, possibly the higher density of
SSDs, and whatever failure rates of the SSDs forcing us to purchase more
(might be better or worse than hard drives). So whatever our little spread
sheet kicks out as cheaper -> that's the one we will purchase!

One random nice thing about SSDs is there will be less case vibration, and the
drives shouldn't be affected AT ALL by the small amount of fan vibration.

~~~
DanielDent
Am I correct in assuming that the performance improvements of SSDs are
worthless for your use case?

It seems like the workload of Backblaze (like many "big data" workloads) is
such that the extra IOPS don't actually help.

With thousands of drive spindles, and a very small portion of the data on on
the spindles being needed at any given moment, I don't imagine you are
actually limited by the performance of spinning rust.

~~~
brianwski
> the performance improvements of SSDs are worthless

No way! We are disk I/O limited right now. We think we can approximately
double our performance simply by using SSDs instead of traditional 7200 RPM
drives. I'm advocating for building an "SSD Pod" built with 1 or 2 TByte SSDs.

> and a very small portion of the data on on the spindles being needed at any
> given moment

With our original Personal Backup product that was true. Everybody was happy
with waiting a few hours for their restores to complete. But with our new
product line of B2 (competes with Amazon S3) then suddenly performance can
matter much more, because programmers using B2 might be implementing all sorts
of different access patterns. If those programmers are using B2 to implement
online backup, then yes, still fine. But if the programmers are building
hacker news or reddit or Dropbox the access patterns might require faster IOPS
to be more responsive.

~~~
DanielDent
I guess the B2 product is growing quickly or is where you are looking to drive
growth? Or are you using different storage pools for different products?

It's surprising to hear this. I would have thought that your backup workloads
would leave enough unused IOPS that you wouldn't be anywhere close to hitting
a performance ceiling.

What metrics do you optimize towards? Tail latency?

------
boulos
So a hard drive weighs about 1.5 pounds, so 60 of them in a pod is about 100
pounds before you even include the pod itself. That doesn't sound particularly
pleasant for the technician to install the first time...

You put 10 of those in a rack and now we're up to 1000 pounds. At what point
does this become challenging for the _floor_?

~~~
protomyth
> That doesn't sound particularly pleasant for the technician to install the
> first time...

I've pretty much reconciled I'm going have to buy something from
[http://serverlift.com](http://serverlift.com) at some point if I keep doing
this job and stuff keeps getting heavier.

~~~
mark-r
Just fill the pod with those helium drives, it will float right into place!

~~~
atYevP
Tried a few once...had to tether them to the ground :P

------
lunixbochs
One note - looks like you can get 8TB WD Helium drives for way cheaper ($250
vs $340) right now by shucking externals: [http://www.amazon.com/Book-Desktop-
External-Drive-WDBFJK0080...](http://www.amazon.com/Book-Desktop-External-
Drive-WDBFJK0080HBK-NESN/dp/B01B6BN0Q2)

~~~
Scramblejams
Thanks for the tip. Though Backblaze has spoiled us with so much good
reliability data, I now feel completely paralyzed in the face of buying a new
drive if Backblaze hasn't told me how much I can trust it!

~~~
atYevP
Yev from Backblaze here -> You can trust it. As long as you're OK with it
dying and have other backups in place ;-)

~~~
Scramblejams
Ha! Well, even if I have backups, I don't like buying replacements, so there's
that...

------
eloff
Now at $0.036/GB. So basically for the cost of storing the data for a single
month in the cloud - and cloud storage is one of the more price competitive
options in the cloud.

~~~
atYevP
Yev from Backblaze here -> We actually offer Backblaze B2 at $0.005/GB/Month,
so slightly less expensive! -> [https://www.backblaze.com/b2/cloud-storage-
pricing.html](https://www.backblaze.com/b2/cloud-storage-pricing.html)

~~~
stephenr
Are you still in a single DC? I'd love to suggest you to clients.

~~~
atYevP
Currently yup! But we're actively looking to expand ;-)

------
asdz
At the rate of 0.036/GB, I'm just paying Backblaze USD50/year. Meaning if I
store more than 115.74GB per month, Backblaze is losing money? And this is the
cost only for storage, it could be 80GB to break even the cost and revenue.
How do you guys make sure you can sustain in the long run? (I don't want to
face another 'Copy case' T.T)

~~~
atYevP
Yev from Backblaze here -> Accurate. We live off the averages. More people
store less data than store more data. A lot of folks that have multiple
terabytes backed up often tell their friends with less data and eventually it
all averages out. We're bootstrapped so we have to be cash-flow positive, and
at our $5/month price-point we are. We've been at this for 9 years too and so
far its worked out OK! No reason to charge a lot for a service if you don't
have to. We might need to move stuff around or change things in the future,
but so far we've never had to raise prices and instead just keep adding
features.

Tell your friends? :)

------
majke
I've never heard about SMR before. I wonder - does it mean that a write to one
area must be done in multiple disk rotations? Ie: if a sequential write to one
file be aggregated into a single location on the disk or will it be spread
around the whole plate. Just wondering how a software utilizing this should
behave.

[http://www.storagereview.com/what_is_shingled_magnetic_recor...](http://www.storagereview.com/what_is_shingled_magnetic_recording_smr)

[https://en.wikipedia.org/wiki/Shingled_magnetic_recording](https://en.wikipedia.org/wiki/Shingled_magnetic_recording)

~~~
codemac
Heads up - if you've bought a high density disk recently, it's probably SMR
internally anyways.

A write to one area can be done in a single disc rotation, as long as it's
sequential.

SMR disks added what are called ZBC, Zoned Block Commands to the SCSI spec.
The disk itself exposes the ability to interact with zones several ways:

\- REPORT ZONES - What zones there are, how many, what type, and state of the
zone.

\- OPEN / CLOSE / FINISH ZONE - zone write state management

So essentially, it's like a drive exposes ~8 files, opened with O_APPEND, and
you have to manage how you stripe your writes between those files to handle
your workload. Your disk can also have zones that aren't shingled, but those
seem to have been fairly unpopular.

------
m4dc4pXXX
I also found an older post describing how they use Scrum to design the
hardware to be pretty interesting: [https://www.backblaze.com/blog/designing-
the-next-storage-po...](https://www.backblaze.com/blog/designing-the-next-
storage-pod).

I wonder what limitations the 6-month production run requirement put on design
- it would seem you would have to be a lot more conservative in your
"definition of done" (a bad decision can be expensive when you buy a half-year
worht of gear).

------
ksec
Whenever i see a post on Storagepod, i always fantasize how much it will cost
Apple to provide free iCloud Backup to its users. If you have 600M active
users and 30GB per backup @ 0.036/Gb. That is 648M, so even triple redundancy
will only cost less then 2Billion.

If Apple could afford to give free OSX update, free iWork and ilife, Why not
free iOS Backup?

Or at least offer a Time Capsule that i can backup my iOS devices without
going to the cloud.

~~~
atYevP
Yea. Right?

------
scurvy
Any reason on 60 drives instead of 70 or more? At that depth, there are
several OEM designs that with that many drives (70).

No 25/40gig networking?

It's nice to point out the depth issue, as older racks are shallower than the
newer versions. This blocks things like the 0U PDU channels and cable
management channels. That said, the newer, deeper (and more popular) racks fit
everything just fine.

How tall are your racks? How many pods are you getting in there? 3ph 50A
power?

~~~
brianwski
> Any reason on 60 drives instead of 70 or more?

One step at a time. :-) We think we can put a few more drives in the
motherboard half of the POD without many other changes, and connect them to
already existing SATA connectors on the motherboard. To get all the way to 75
drives we will need another row of port multipliers, which then means we need
even more SATA cards, etc.

> No 25/40gig networking?

We're currently not able to saturate the 10 Gbit so for us it is wasted money
to go faster. Remember that we run these in "vaults of 20" so we have 20 pods
EACH with 10 Gbits so the vault hoovers in data at 200 Gbits/sec if you can
thread the application.

> How tall are your racks?

Most of our racks are 45U tall where the PDU takes 4U, so we can fit (not
necessarily power) 10 pods per rack and still fit a network switch in the 1U
left at the top. In the past (with 45 drive pods) we ran 2 circuits of 30A
208V power (three phase). The two circuits are redundant - our power strips
can flip over from one to the other if there is a loss of power on the
currently used power. But I think they only put 8 or 9 of the 45 drive pods in
a cabinet because it slightly exceeds what the datacenter likes. And now that
we're moving up to all 60 drive pods, we have to rejigger the whole thing.

It is also worth mentioning that when we bring a 20 pod vault online, we put
each of the 20 pods in different racks in different parts of the datacenter.
This helps keep the vault completely online if any one component fails. For
example, if the power strip feeding the pods in 1 rack power dies, it only
takes 1 pod offline out of several different vaults, so ALL the vaults stay
online both reading and writing data. (I hope that made sense.)

~~~
scurvy
Makes sense. Thanks for the reply!

As to spreading the vaults around the DC, it sounds like that's more to
prepare for ToR switch failure -- its sounds like an ATS handles your PDU and
power circuit failover.

Yeah it sounds like you're mostly CPU bound calculating EC if you're going to
rebalance and replace failed pods so 25 gbps ethernet probably isn't
interesting (yet). That said, it's only marginally more expensive than 10gbps
twinax. Interesting that you're running 10gig baseT. Interesting from a power
and latency perspective, though I'd imagine you're not worried too much about
latency.

------
bluedino
Any idea what software is used to make the Build Book?
[https://f001.backblaze.com/file/Backblaze_Blog/Storage-
Pod-6...](https://f001.backblaze.com/file/Backblaze_Blog/Storage-
Pod-6/Build+Book+Backblaze+60+Drive.pdf)

~~~
atYevP
Yev from Backblaze here -> You're not going to believe this, but the folks at
Evolve used Powerpoint and a template...seriously.

~~~
bwilliams18
You would be astounded to learn how much use MS Office still gets today.
Excel...that thing runs the world.

------
Already__Taken
Where or how do you find places that make chassis? I'd love to buy 100 or so
completely basic cases to our design.

------
Scarbutt
Dropbox manage to do 1PB per server, why doesn't backblaze do it?

~~~
venomsnake
Backblaze are/were bootstrapped. They don't have the money to burn to move
outside of the sweetspot $/GB - so their logic starts with which size drive is
expected to be cheapest per GB in the next 12 months. I remember some of their
old posts explaining why they made the jump from 2 to 3tb drives - they are
slightly more expensive today, but we expect them to go down in price soon.

~~~
atYevP
This is accurate. Unlike many Silicon Valley companies, we live within our
means :D

------
nwmcsween
Whoa wait they don't use ecc ram?

------
geerlingguy
I accidentally misspelled 'BackBlaze' in the title; can someone fix it please?

Interesting decision by BackBlaze to stuff an extra row of drives in,
lengthening the chassis to 33" instead of just under 29". It is _their_
design, so they're free to do what they need for their own racks, but this
does mean anyone trying to install a pod based on the 6.0 design couldn't use
many closed-back racks or other kinds of racks.

~~~
cthulhujr
Are they offering the 6.0 controllers, etc. with the standard length housing?
I would think that's an option soon, if not now. Perhaps the 6.0 controllers
have no benefit with fewer drives.

~~~
brianwski
Brian from Backblaze here. Just to be absolutely clear, you cannot "buy" one
of these from Backblaze. We don't sell them. You are free to take these
instructions and assemble one yourself though! And so yes, you could simply
drop the 6.0 controllers into an older chassis and it will work fine.

Also, the 6.0 housing will fit in most racks. If somebody can either measure
their racks or include a link where the 33" chassis will no longer fit, we
really want to hear about it!

~~~
toomuchtodo
Protocase, the vendor Backblaze sourced to fabricate the enclosures, has their
own subsidiary that'll sell you turnkey pods:

[http://www.45drives.com/](http://www.45drives.com/)

~~~
duskwuff
Surely they should be rebranding to 60drives now? ;)

