
Petabytes on a budget: How to build cheap cloud storage (2010) - Oculus
https://www.backblaze.com/blog/petabytes-on-a-budget-how-to-build-cheap-cloud-storage-2/
======
KaiserPro
I look after about 15PB of tier1 storage, and I'd recommend not doing it the
backblaze way.

Its grand that its worked out for them, but there are a few big drawbacks that
backblasze have software'd their way around.

Nowadays its cheaper to use an engenio based array like the MD3260 (also
sold/made by netapps/LSI)

First you can hot swap the disks. Second you don't need to engineer your own
storage manager. Thirdly you get much much better performance(2gigabytes a
second sustained, which is enough to saturate to 10 gig nics). Fourthly you
can get 4 hour 24/7 response. Finally the air flow is a bit suspect.

we use a 1u two socket server with SAS to server the data.

If you're brave, you can skip the raid controller and the JBOD enclosure
instead and ZFS over the top. However ZFS fragments like a bitch, so watch out
if you're running at 75% plus

~~~
brianwski
Disclaimer: Backblaze employee here.

> Nowadays its cheaper to use an engenio based array like the MD3260

Cheaper than what? Backblaze still builds it's own storage pods because we
can't find anything cheaper. We would LOVE getting out of that part of our
business, just show up with something 1 penny cheaper than our current pods
and we'll buy from you ENDLESSLY. NOTE: we have evaluated two solutions from
3rd parties in the last 3 months that were promising, but when the FINAL
pricing came in (not just some vague promise, wait for the final
contract/invoice) it was just too much money, we can build pods cheaper.

If your goal is to keep your job as an IT person - spend 10 or 100 times as
much money on super highly reliable NetApp or EMC. You won't regret it. Your
company might regret the cost, but _YOUR_ job will be a bit easier. But if you
want to save money - build your own pods. NOTE: remember Backblaze makes ZERO
dollars when you build your own pods.

~~~
qthrul
Disclaimer: VCE employee here. VCE is an EMC Federation Company [tm] with
investments from Cisco and VMware.

> We would LOVE getting out of that part of our business [...]

That's a common theme when I meet with some unique companies that are still
building out their own stacks of kit and just enough software to keep their
consumption patterns consistent. It is about necessity some years but often
comes down to the assembly line methods and sunk investments.

I'm curious how much variance you guys have / tolerate or if you ever see
reaching VMI (maybe you are already there?) procurement levels?

Market wise, it isn't too outlandish to expect many of the "super highly
reliable" options to have a future that is a spectrum of DIY + software (100%
roll your own) all the way up to a full experience (status quo). Today, that
roll your own option is a progression of hardware compatibility lists to
reference architectures but there are only a few companies that go beyond
that.

I'm curious to hear how much you see as "too much money" to keep your team
from getting out of the rack/stack/burn/test/QA/QC/deploy. Do you plow the
"too much money" less "build pods cheaper" into a value that can be compared
an opportunity cost for the labor involved?

~~~
brianwski
> I'm curious to hear how much you see as "too much money" to keep > your team
> from getting out of the rack/stack/burn/test/QA/QC/deploy.

Early on the "too much money" would have been maybe a 20 percent premium. But
now that we have scaled up and have all the teams in place, "too much money"
is probably even a 1 percent premium. In other words, if we moved to a
commercial solution now it would be because we actually SAVE money over using
our own pods. That shouldn't be impossible - a commercial operation could make
their profit margin out of a variety of savings, like maybe their drives fail
less often than ours do because they do better vibration dampening. Or since
this is "end-to-end cost of ownership" if the commercial system could save
half of our electricity bill by being lower power then they can keep 100
percent of that.

We have a simple spreadsheet that calculates the cost of space rental for
datacenter cabinets, the electricity pods use, the networking they use, the
cost of all the switches, etc. It isn't overwhelmingly complex or anything. It
includes the cost of salaries of datacenter techs with a formula that says
stuff like "for every 400 pods we hire one more datacenter tech", stuff like
that. Hopefully if we went to a commercial built storage system we could
increase the ratio like only hire one more datacenter tech every 800 pods. The
storage vendor could pocket the salary of that employee we did not hire.

Or heck, maybe a large vendor can get a better price on hard drives than we
get. We buy retail.

> reaching VMI (maybe you are already there?) procurement levels?

I'm not completely sure I understand the question, but here is one answer: in
a modern web and email and connected world with tons of competition, I'm not
sure there are any economies of scale past the first 10 computers you build.
Places will laser cut and fold sheet metal in units of "1" for extremely
reasonable prices, it's AMAZING what the small manufacturers can do for you
nowadays.

------
2close4comfort
[https://www.backblaze.com/blog/backblaze-storage-
pod-4/](https://www.backblaze.com/blog/backblaze-storage-pod-4/) here is the
latest version. I have loved being able to follow the iterations of the
storage pod. They are very thoughtful about HW choices but leave it open which
I think is the best part!

~~~
atYevP
Yev from Backblaze here -> Glad you're enjoying the posts! We enjoy spreading
that information, it's a rare thing, so we're pleased as punch that folks are
reading them :)

~~~
cc439
It's more than rare, it seems that you're the only real-world, large-scale
source on a lot of the topics you've been covering.

I may not work in IT but I crave well-written accounts of technical problem
solving. Yours are among the most interesting as you're so open about sharing
details that no one else would consider releasing.

~~~
brianwski
Brian from Backblaze here.

Awww, you are so kind you make us blush. :-)

I do wonder why big companies like Google and Yahoo and Apple don't release
more interesting statistics on drive failure. I've heard many people assume
they are getting some sweet pricing on hard drives with an agreement never to
release those numbers. But if that were true, why won't Seagate or Western
Digital or HGST ever give Backblaze a deal to shut us up? Those three
companies have refuse to sell to Backblaze "directly" (their business plan is
to sell to distributors like Ingram who mark up the product and sell it to
NewEgg who marks up the product and sells to Backblaze). I assure you, there
are no backroom undisclosed pricing deals for Backblaze.

So I suspect Google and Apple pay retail drive prices also, and is not
disclosing drive failure rates because they don't see the profit in it?

~~~
atlbeer
2007:
[http://research.google.com/pubs/pub32774.html](http://research.google.com/pubs/pub32774.html)

~~~
atYevP
But no names named :(

------
jpalomaki
I've always been very suspicious about backup vendors offering unlimited space
for fixed price. These storage pod posts by Backblaze was primary reason why I
decided to give their service a try. Knowing the technology behind the system
made it much more credible for me.

~~~
brianwski
Brian from Backblaze here.

This is EXACTLY one of the reasons we released the information. People would
write about Backblaze and say we were losing money and it was an unsustainable
business funded by deep pockets. That was frustrating to hear because we self-
funded the company for years (no VCs) and I assure you there are no deep
pockets here.

So we released the info on the pods partly so we could point at it and say
"see, the math really DOES work!"

~~~
harel
That: no VCs. That is a sign of a healthy company if you can do it without
someone else's money (read: selling your soul for cash). I often see people
with an idea worrying more about getting VC money than executing the idea and
I can never understand it. If your idea is solid you don't need external
money. There are valid reasons to chase it (facilitating rapid growth for
example) but I rarely see VC money sought after for the right reasons.

------
andyidsinga
> A Backblaze Storage Pod is a Building Block

> But the intelligence of where to store data and how to encrypt it,
> deduplicate it, and index it is all at a higher level (outside the scope of
> this blog post).

I'm curious about their software that works outside the nodes too. I've been
working on storage clusters over this past 9 months using the Ceph (
[http://ceph.com/](http://ceph.com/) ) open source storage software. Its
pretty amazing -- and I suspect it could be deployed to a set of backblaze
pods too.

It seems to be that for production environment where you wanted to maintain
availability you would to build at least 3 of those pods for any deployment -
enabling replication across pods/storage nodes.

~~~
xorcist
Care to write something about your experiences with Ceph? I was recently
looking at something that could be a use case for it, but it is a bit too
close to the holy grail of storage for me to feel comfortable. I've toyed with
similar systems in the part and they've always been very iffy with their POSIX
semantics.

~~~
andyidsinga
We're not doing anything explicitly posix with ceph - not using the ceph
filesystem. We're using ceph for block storage under virtual machines that use
KVM/QEMU with libvirt etc. The block devices exposed to the VMs end up using
having an ext4 filesystem on them but that has nothing to do with ceph.

So far seems fantastic from a VM perspective -- I can take down storage nodes,
add new nodes etc, all while the VMs running up top stay alive and well. I
once lost 1/3 of my cluster (on a 3 node cluster with 30 disks) ..and
everything stayed running until I fixed the broken node.

Ceph also supports Object storage and an S3 interface -- I haven't
experimented much with that yet, but I hear its good.

~~~
xorcist
Thanks, that sounds like a very interesting use case. It sounds almost perfect
for KVM storage. How do you expose the file system to the KVM nodes? Just
mount it?

What's performance like so far?

~~~
andyidsinga
qemu has support for ceph's "rbd" ( rados block device). you create an image
of a particular size using the ceph tools, then when you create a vm xml
definition you specify and rbd disk. ( see here
[http://libvirt.org/formatdomain.html#elementsDisks](http://libvirt.org/formatdomain.html#elementsDisks)
)

finally, when the vm is booted you can use that block device like any other
machine. for first install it will be empty ( need to attach a cdrom image to
get going ) then youncan start making snapshots for "thin provisioning" using
copy on write clones ...very sweet stuff :)

re performance ...i think its good, but i have heard its not as good as
expensive proprietary systems. the product we're building is around storage
analytics on open source. ..so eventually well know more :)

------
codeonfire
What do you do when a drive goes bad? Do you move ~50TB of data, pull the
entire pod out of the rack, and then try to determine which of 45 drives is
bad?

~~~
rythie
Why would you need to do any of that? RAID6 can tolerate 2 drive failures and
Linux will tell you which drive is bad. Just slide the pod out and replace the
drive, no data lost, very little down time.

~~~
codeonfire
Three drive failures? My question is how do you practically determine which
drive to swap. I don't see any labels or anything. Also I read the current
version supports rails. The one in the article looks bolted to the rack. The
article had no date on it and it sounds like a lot of the issues have been
addressed.

~~~
brianwski
Brian from Backblaze here.

> how do you determine which drive to swap

Every two minutes we query every drive (all 45 drives inside the pod) with the
built in linux smartctl command. We send this information to a kind of
monitoring server, so even if the pod entirely dies we know everything about
the health of the disks inside it up until 2 minutes earlier. We remember the
history for a few weeks (the amount of data is tiny by our standards).

Then, when one of the drives stops responding, several things occur: 1) we put
the entire pod into a "read only" mode where no more customer data is written
(this lowers the risk of more failures), and 2) we have a friendly web
interface that informs the datacenter techs which drive to replace, and 3) an
email alert is sent out to whomever is oncall.

Each drive maps has a name like /dev/sda1 and these drive names are
reproducibly in the same location in the pod every time. In addition (before
it disappeared) the drive also reported a serial number like "WD‑WCAU45178029"
as part of the smartctl commands which is ALSO PRINTED ON THE OUTSIDE OF THE
DRIVE.

TL;DR - it's easy to swap the correct drive based on external serial numbers.
:-)

~~~
codeonfire
Ok, thanks for the info. That doesn't sound too bad.

------
jdub
But in 2014, 1PB in S3 – with 11 nines of data durability – costs ~USD$30,000.

~~~
epistasis
That's $30,000 per month. These pods will last 3-5 years, which is $1-$1.8M of
S3 storage.

Say you store 3-4 copies of the data on the BackBlaze pods in order to attempt
comparable durability, and then the cost comparison comes down to power,
space, network, and manpower costs. These costs are _extremely_ variable, so
depending on access to these, I could see someone still going with S3. S3 also
scales larger and smaller quite quickly, unlike investing in your own
equipment.

~~~
jdub
Yeah, so, over three years in us-east-1:

    
    
      S3 = ~USD $1,085,552.64
    

Or more realistically:

    
    
      Glacier = ~USD $377,487.36

~~~
epistasis
I wish Glacier was a realistic option for me, but I need to look at my data
again.

------
bkruse
This has always interested me. I need to do decently big storage for genomic
data. It doesn't have to be fast, but it needs to be able to survive one data
center blowing up. If I have 3 data centers, need to store 2-3 petabytes and
need storage to survive in the case of a data center failure - the solutions
really narrow down when you have to get under the $200/tb range.

Playing with Swift now - but it has really opened my eyes to how much more
difficult 2-3 petabytes of storage is (disk failures, number of disks in your
infrastructure, the time to redeploy a datacenter on a 1gpbs connection). All
the little problems become much bigger!

~~~
ddorian43
Don't know if swift has erasure coding yet, but QFS[0] has it.

[0]: [http://quantcast.github.io/qfs/](http://quantcast.github.io/qfs/)

~~~
thrownaway2424
Single metadata node still? Seems like a good way to manage a large amount of
data, and then lose it all.

------
harel
Tech aside, I'm quite curious about the economics of storage here. By the
price tags and 'Debian 4' I'm guessing this is an older post. But still, $7867
per 67TB and $5 per month, means they need 131 users pay for one year to
recoup the cost of one pod, assuming those 131 do not generate over 67TB worth
of storage in that period of time. I've not factored in data centre costs,
salaries etc. Just a pod. I'm guessing they have enough users as they have
been around for many years now, but still, $5 seems a bit on the cheap side to
me (not that I'm complaining)

~~~
brianwski
Brian from Backblaze here.

> I'm quite curious about the economics of storage here.

When we started almost 8 years ago, we were CLUELESS about the economics of it
all, and scared witless over it. :-) But I'm happy to report the grand
experiment has been a modest success, we are profitable (by a small amount).
We even have a little money left over every month to spend on marketing
efforts now, and we're growing by every metric - more customers each year,
more revenue each year, more employees to swap failed drives (38,790 spinning
drives as of yesterday - some fail like clockwork each and every single day).

~~~
coned88
any change you will ever offer backup without the need for a client. Sometime
a user could just use sftp or something to send the files to. Something more
open so I don't have to install your software. That's what holds me back.

------
corv
I wonder how they can guarantee data integrity.

Are they checksumming on a higher level and is that cheaper than using ZFS
with necessarily more expensive hardware?

~~~
epistasis
As of 2012, they checksum each file and check it every few weeks, and if a
file went bad they have the backup client retransmit it:

[https://help.backblaze.com/entries/20926247-Security-
Questio...](https://help.backblaze.com/entries/20926247-Security-Question-
Round-up-)

~~~
brianwski
Brian from Backblaze here.

Yep, we checksum the file on the client before sending it. In practice for any
one user the number of "silent bit flips" on disk is very low and you probably
wouldn't notice it (maybe 1 note in your mp3 file is slightly off). But when
we check the checksums every few weeks we absolutely see it occur in our
datacenter at our scale probably every day and take steps to heal the files
(either from other copies we have or ask the client running on the customer's
laptop to retransmit the file).

------
immortalx
I decided to try it. You can only select entire hard drives and work around
this by excluding folders. That's odd but what i don't understand is why you
cannot exclude your c:/ (or Main Drive). Why should anyone be forced to backup
something?

Seems to me like the design is backwards and doesn't make any sense.

~~~
brianwski
Brian from Backblaze here.

The focus of our backup system is being "easy to use". We get a lot of
advanced users who are computer savvy, but we also get some mom and pop novice
who aren't very good with computers.

Ok, so we don't allow you to unselect your C:\ (main) drive. Here is the
reason: when given that choice an alarmingly high number of novices
accidentally unselected the drive. (sigh)

We don't charge customers any more or less on what they backup, so there are
almost no legitimate reasons to UNSELECT a drive. What data do you have that
you hate so much you MUST lose it when your computer is stolen? So we error on
the side of helping these novice users and don't allow them to unselect their
main laptop drive. The down side is it sometimes irritates the advanced
customers who know what they are doing. :-)

------
ciupicri
If you submit an old article even if newer versions of it exist, at least
mention the year in parenthesis.

------
hendzen
Honest question, why JFS vs say, ext4?

~~~
Oculus
In later blog posts they discuss moving over to ext4 despite the 16TB size
limit.

~~~
brianwski
Brian from Backblaze here.

Correct, we are entirely on ext4 now (there might be one or two old pods
running JFS left, I would need to double check).

We figured out a way to work around the 16 TByte volume limit originally, but
now ext4 even supports larger volumes so that isn't a problem at all anymore.

More random history: At the time we chose JFS it looked like XFS might lose
support (XFS was created by a now defunct company I used to work for called
SGI). Well, the world is a strange place and XFS now has more support than
JFS, despite SGI going out of business. But in the end, ext4 is the "most
commonly used" linux file system and therefore is arguably the most supported
with extra tools and such, and the performance of ext4 in many areas seems
higher than JFS (at least for our use case) so the decision becomes easier and
easier to go with ext4.

------
fredsted
>In the future, we will dedicate an entire blog post to vibration.

In the meantime, does anyone have a link?

------
aliakhtar
Do they only do backups or also cloud storage? If they have an API for
uploading / deleting / viewing files, I'd use them over S3 given how lower
their costs are. But, I can't find any info on that on their website.

~~~
eeZi
No API. Only backups, and you need their client.

~~~
brianwski
Brian from Backblaze here.

To expand on this, we require our client because of our pricing model, which
is $5 per computer for unlimited storage FOR THAT ONE laptop or desktop. An
API would mess with that pricing model, so we haven't opened it up yet.

------
mschuster91
I'd love it if either Backblaze or a 3rd party makes a business of selling
these pods!

edit: just spotted it, their boot drive is PATA?! Why is this, given that PATA
drives are slower and more expensive than SATA ones?

~~~
spartas
45Drives [http://45drives.com](http://45drives.com) is one company that's
selling pre-built versions of the storage pod.

~~~
mschuster91
[http://www.45drives.com/store/order-
beerinator.php](http://www.45drives.com/store/order-beerinator.php)

A 4U beer cooler, not sure if awesome or stupid... but still, a nice idea.

~~~
atYevP
Yev from Backblaze here -> We are about to have one on order...maybe a whole
cabinet full ;-)

------
ksec
Need the Add the Year at the title. This post is old.

------
tkinom
Great article!

Love to see more write up on software selection process, tradeoff, failure
recovery process/methods and benchmark data.

------
sidcool
Quite detailed post. Loved reading it.

------
iflyun
why such an old and expensive cpu?

~~~
tjl
It's an old post. See the latest post,

[https://www.backblaze.com/blog/backblaze-storage-
pod-4/](https://www.backblaze.com/blog/backblaze-storage-pod-4/)

for more details on the new one.

