
Backblaze Vaults: Zettabyte-Scale Cloud Storage Architecture - nuriaion
https://www.backblaze.com/blog/vault-cloud-storage-architecture/
======
tiemand
Reading the comments, is anyone else bothered by this reply from a Blackblaze
representative:

"Right now, Backblaze has only one datacenter, so the short answer is "no".
:-)

The longer answer is that for online backup, there is one copy of your data on
your laptop, and another copy in the Backblaze datacenter in Sacramento. If a
meteor hits our datacenter in Sacramento pulverizing it into atoms, you STILL
would not lose one single file, not one - because your laptop is still running
just fine where ever you are with your copy of the data. In the case that
occurs, we will alert our users they should make another backup of their
data."

There are a million one things other than a comet strike that can go wrong in
a data centre. I would not trust a backup provider that does not replicate my
data across at least two data centers.

~~~
atYevP
Yev from Backblaze here -> It's true, we just have the one. Backblaze was
bootstrapped, so we cannot over-expand and maintain profitability, which
allows us to stay in business. We're pretty up-front about having the one
datacenter. We'd LOVE to add more in the future, but truthfully, it would at
least double our costs, and we'd need to raise prices. We're thinking of ways
of avoiding that while maintaining our current business-model. If you are
looking for something more geo-redundant, take a look at services like Amazon
S3, they are great, but the downside there is they charge per GB to make up
for the extra costs, so depending on the amount of data, it can get pricey.
Either way, we recommend a 3-2-1 backup policy (3 copies of your data, 2
onsite but on different mediums and 1 offsite) as a good start to a backup
strat. We're just one solution of many, though, we like to think we're the
easiest one!

~~~
kbenson
How about an add-on cost/service that tags your data as needing datacenter
redundancy, and only replicating that to a new datacenter. It has the benefit
of not requiring as much up-front investment, as it's used it pays for itself,
and you have a bunch of current customers you can upsell to. The architecture
to segregate redundant from non-redundant backup customers could be a pain,
but as long as you have tools to migrate data between systems (I imagine you
do), then it could just be running two separate backblaze clusters in the
first datacenter, one which supports redundancy and one which doesn't, and
then just migrate customer data between the clusters as the add/drop the
redundancy service. That saves you from having to cherry-pick specific
files/customers from the the cluster to duplicate in the other datacenter, you
just make sure one cluster is always redundant.

~~~
atYevP
We're definitely looking at options like this, but the engineering work that
it would take to implement solutions like that are not insignificant, and a
lot of our engineering muscle has been working to roll out the Vaults over the
past year and change! It could certainly be another revenue stream for us, but
building out a new datacenter is expensive, especially if you don't
buy/guarantee build-out ahead of time, so we'd have to forecast how many
people would want that service to prepare accordingly, again not insignificant
stuff, but is definitely possible in the future!

~~~
kbenson
The nice thing about using separate clusters is that you can build them out in
chunks. Build X new capacity in your main datacenter as a new cluster, and X
new capacity in a different datacenter, and replicate. Need more redundant
capacity? Build Y new capacity in your main datacenter, and Y new capacity in
a different datacenter, not even necessarily the same backup datacenter as
before. You end up with one main non-redundant cluster, and a bunch of smaller
redundant clusters spread over one or more additional datacenters.

If you're _really_ lucky, you siphon off customers from the non-redundant
service for this at the same or faster rate as they are signing up for the
non-redundant service, allowing you to not have to build that out much for a
short while.

------
AceJohnny2
"For Backblaze Vaults, we threw out the Linux RAID software we had been using
and wrote a Reed-Solomon implementation from scratch. It was exciting to be
able to use our group theory and matrix algebra from college. We’ll be talking
more about this in an upcoming blog post."

I hope I'm not the only one uncomfortable about this. I mean, I understand the
need for greater flexibility and features that MDRaid doesn't provide, but
this wording stinks of NIH and rebuilding the wheel from scratch "because it
was fun", discarding the maturity and reliabilty of established software. And
data storage is _all_ about reliability.

I hope I'm just reading too much from this, and it isn't actually
representative of Backblaze's engineering practices.

~~~
brianwski
Brian Wilson from Backblaze here (not the "Brian Beach" who is author of the
blog post) - this was not a case of NIH. We use lots and lots and lots of
existing software like Debian, Java, Ext4. We use tools like ansible and
Zabbix. But this one thing just didn't exist for us in the form we needed it.
We looked, we really did.

We did write the Reed-Solomon ourselves in a "clean room" so we did not have
to pay any licensing fees and we clearly didn't steal anybody else's source
code, but that is a very small amount of code. Like 80 lines of Java.
Seriously. We referenced the technical papers we read to implement it in that
blog post, but here it is again:
[http://www.cs.cmu.edu/~guyb/realworld/reedsolomon/reed_solom...](http://www.cs.cmu.edu/~guyb/realworld/reedsolomon/reed_solomon_codes.html)
And we unit tested the living heck out of that code, plus we mathematically
verified various parts.

But I'm open to an alternative solution if you can suggest one? Remember our
three highest priorities are: 1) reliable, 2) low cost, 3) simple. The "low
cost" includes things like we do not want to pay ongoing licensing fees to
other companies.

~~~
AceJohnny2
Thanks, that's basically what I wanted to hear. :)

My first thought was that you could've reused the R-S code from mdraid or dm-
raid or ZFS, but on second thought 1) it may be too specialized to be
reusable, and 2) it's GPL (or CDDL), so you can't just plonk it into your own
code.

And yeah, if it's just 80 lines of Java, I'm worrying about the wrong things.

~~~
fragmede
> 2) it's GPL

Backblaze is a web service, so for better or worse, the GPL doesn't apply
here, since we never have access to their binaries. The AGPL would apply, but
that's not the license used.

------
ChuckMcM
That was a great read. I was surprised that the data on a drive tries to go
back to the same drive when it fails. One of the things that GFS did (I
presume they still do) and Blekko does is that when a drive fails its data is
reconstructed on other working drives so replacement leaves no long term
degradation risk. If you don't do that then while your drive is dead you have
lost some data resiliency until it gets replaced, as opposed to just waiting
until it has been successfully recreated elsewhere[1].

Its no wonder storage companies like EMC are hurting when you have innovators
like these guys out there.

[1] Which give a 3x replication system and a sharded or (chunked) file can
happen pretty quickly.

~~~
brianwski
Brian from Backblaze here. We do angst over the idea of a "hot spare" where
the very second we fail a drive it can begin rebuilding elsewhere. But that
takes up redundancy even when it is not used (an extra drive waiting) which
raises cost.

At our current scale it is becoming less and less of an open debate, because
we now have 7 day a week staffing at our datacenter and the datacenter techs
jump right in and replaced failed drives often within an hour or so. A "hot
spare" would only save a couple hours of rebuild time. But remember, your
mileage will vary - until you reach half our scale you cannot afford even a
Monday-Friday datacenter tech, so you might only be able to replace failed
drives on Mondays and Wednesday, which widens your exposure.

~~~
tacticus
Have you considered rebuilding into already available space in the cluster?

Something similar to how ceph or swift handles rebuilds? you get rid of the
individual disk sitting around as a spare. though it would break the idea of a
tome being a specific collection of disks. you would need to be able to
identify and move a shard around your cluster into other vaults and a shard
would need to be smaller than the raw disk size.

this would increase network overhead as well. (more movement.)

I'm probably just rambling here so you can probably ignore me. (you have
awesome tech there though)

~~~
brianwski
> I'm probably just rambling here

:-) Not at all! Don't assume we're some perfect team of scientists that know
all the correct solutions before we start coding. We often angst over these
decisions and designs, knowing that once we write the code a lot will be set
in stone (hard to change) for a numbers of years. The reason it becomes hard
to change is we don't have a huge development team that can afford to rewrite
the software every year, so we try to get it correct and then go on to work on
new things or polishing up corners that need polishing.

------
sp332
I'm really looking forward to the Reed-Solomon article. It seems that very few
RAID-like applications are built to handle arbitrary data and parity stripes.

~~~
fatratchet
Here's a nice explanation of how ZFS does triple parity raid with reed-
solomon:

[http://people.freebsd.org/~gibbs/zfs_doxygenation/html/da/dc...](http://people.freebsd.org/~gibbs/zfs_doxygenation/html/da/dc9/RaidZ.html)

[http://people.freebsd.org/~gibbs/zfs_doxygenation/html/d1/d7...](http://people.freebsd.org/~gibbs/zfs_doxygenation/html/d1/d7d/vdev__raidz_8c.html#_details)

~~~
sp332
_Note that the Plank paper claimed to support arbitrary N+M, but was then
amended six years later identifying a critical flaw that invalidates its
claims. Nevertheless, the technique can be adapted to work for up to triple
parity._

So better than most, but still not arbitrary. And the limiting factor seems to
be write performance.

------
jewel
Getting rid of RAID makes things a lot easier since you don't have to suffer
through rebuilds, which causes a lot of I/O for the entire RAID. You still
have to repopulate the drive, but you have fine-grained control of when to do
it and even which files have the highest priority.

For those looking to build something similar, check out ceph or gluster.

Is a single file spread across multiple data centers? At the claimed 99.99999%
annual durability, doesn't the chance of a natural disaster that could take
out the entire data center start being a major factor?

I realize that the customer also has a copy of the data so you don't have to
take the same precautions as something like S3, but it'd be sad if a
datacenter got taken out by a meteor or airplane crash the same day that the
customer's laptop was stolen.

Finally, a question for backblaze devs. In your opinion, how often do you need
to scrub a drive to check for problems?

~~~
atYevP
Yev from Backblaze -> That meteor question comes up a lot
([https://www.backblaze.com/blog/vault-cloud-storage-
architect...](https://www.backblaze.com/blog/vault-cloud-storage-
architecture/#comment-1901229080)). We currently do have one data center, but
this design allows us to bring others online. If the datacenter was hit by a
meteor all our customers would get an email blast urging them to create a
local backup immediately. The chances that both the DC and the user would get
hit by the same natural disaster are relatively small. Still, it's not a
storage service like S3 so geo-redundancy plays a smaller role. We do plan on
building out other datacenters in the future, but since we're bootstrapped, we
have to do that when the time is right, otherwise it would be very easy to
over-extend ourselves and start losing money.

 _edit_ -> I ignored your Backblaze dev question, sorry. We have multiple
processes running at all times on the pods, and they go shard-by-shard. We're
always optimizing, but the short answer is, we're always looking for errors.

~~~
jacquesm
It doesn't take a meteor. I got hit when EV1 had a fire in one of their DCs,
fortunately we had a local backup from which we restored and kept on running
but a lot of companies were not in that position and had a hard time
surviving. EV1 was hurt badly by this, it's not just meteors. What happened
was that a transformer on the floor exploded, took a dividing wall with it and
cause a (surprisingly!) relatively minor fire.

What took down the DC for several weeks was the firedepartements
investigation. They took their time to figure out the root cause of the fire
which is their right but the collateral damage of that was substantial.

So don't just plan for meteors.

~~~
jewel
The nice thing from backblaze's perspective and their customers is that
downtime is far more tolerable than it would be for most businesses. Most
disasters that are going to impact a data center aren't going to destroy the
physical hard drives, assuming the data center has the usual safeguards in
place.

~~~
brianwski
Brian from Backblaze here-> this is definitely true. If you ask to have a 5
TByte restore prepared, it will take us a full 22 hours to get that all
assembled for you. If you want us to FedEx the prepared restore on a USB hard
drive, it will take ANOTHER 24 hours, and if you are in Europe it's more like
48 hours.

And what's luxurious about "backup" as a business is this doesn't bother many
customers. As long as we keep communicating to them on the progress, and we
assure them they are going to get every solitary bit/byte/jpeg/mp3/movie back
- they often tell us to take our time and do it right. For "backup" accurate
and durable is about a thousand times more important than "instant
gratification".

------
chiph
_At our current growth rate, Backblaze deploys a little over one Vault each
month._

That's a full rack of storage pods a workday. Some back-of-the-envelope math
says that's almost a tractor-trailer worth of hardware a month. Wow.

~~~
atYevP
Yev from Backblaze here -> We're VERY proud of our datacenter techs, and you'd
hear more about them if they weren't so shy. It IS a monumental achievement
though, considering that a few years ago we only had two guys running our
entire farm.

~~~
tootie
The client I'm working for failed to get a web server procured with 6 months
lead time. And that's a company 100X bigger than Backblaze. I wish they
understood how much money they are wasting by being stingy.

~~~
atYevP
All you can do is keep sending them these posts :)

------
arca_vorago
I find the work Backblaze is doing is wonderful, mostly because of how open
they are with their data. Watching their numbers on certain HDD failures has
really helped steer some of my recent purchasing decisions. I'm also really
interested to see more details on their custom RAID replacement.

Essentially raid is dead to me, and ZFS/BTRFS, etc seem to be the only way to
go forward, so I hope they gpl the code.

For anyone from BackBlaze reading this, I'm curious though, have you found
that backplanes are becoming a primary bottleneck? Because that seems to be
the case (Sata 6gb/s hurts after using things like fusion-io or even
thunderbolt.) Any insights into the future of backplanes?

------
willtheperson
Backblaze seems so forward thinking with the hardware, but if you've ever
tried to restore a file using their web interface, it's an exercise in
frustration.

If you want to pull a single file, you'll be navigating through a windows
95'esq tree. They store snapshots, but if you want to change the snapshot, you
wait a minute for each while it loads. Even going back to a snapshot you were
just looking at, you wait the whole load time.

Now if you actually need to restore something. You can download a zip file.
They will only let you make the zip file so big, so you have to break up your
restore into multiple zip files. You will have to do that manually, there is
no way to have BB auto-generate the parts for you. These are zip files and not
one of the many archive formats that allow for parts.

Besides that, you will need double the amount of storage to recover this data
since you'll need to store the zip and the extracted backup.

The way around this is to pay BB to put the data on a USB drive (flash up to
128GB or external up to 4TB) at $99 and $189 respectively. A 128GB external
usb on Amazon, first result is $120. They actually won't give you a 4TB unless
your restore needs it, but the price is still $189. Labor I guess? According
to their own FAQ, you will wait 2-3 days for them to ship these drives, so
hopefully you don't need that restore anytime soon.

I really liked Backblaze up to the point I needed to use it for it's real
purpose. It seems like nobody at BB cares about the restore process or that it
doesn't sell new subscriptions.

\--

Also, maybe someone from BB can explain why the secure.backblaze restore
website loads tracking pixels from googleads.g.doubleclick, a.triggit,
s.adroll,facebook, ads.yahoo,x.bidswitch, ib.adnxs and idsync.rlcdn. Are you
selling my need for a new harddrive or something?

~~~
brianwski
> It seems like nobody at BB cares about the restore process

Brian from Backblaze here -> I care! It just keeps getting bumped by something
higher priority. I have a spec for how to speed up the restore tree browsing,
it's just waiting for us to have a spare moment. For a while the Vaults took
precedence.

Part of running Backblaze without VC funding is we can only hire programmers
when we can afford it out of profits, and we're up to about 6 programmers (the
result of a recent burst of hiring) which handle all of: Windows, Macintosh,
iOS, Android, and in the datacenter they built the pods, the Vaults, and the
web front end. But we'll get there, I swear.

> or that it doesn't sell new subscriptions.

This is unfortunately the heart of the problem. The most important thing to
get smooth as glass is the BACKUP part, that sells new subscriptions. If we
have your data safe, we can always hobble through a restore even if it is a
little slow and clunky we can get all your files back after your laptop is
stolen. If it held up sales, we'd jump over and do the one week of work to
speed it up.

------
coreymgilmore
Always impressed by the write-ups the Backblaze team does. Very informative
and very clear on what you guy are achieving and how you do it.

Crazy amount of hardware involved, and Backblaze is the "small kid on the
block" in relation to FB, Google, and Amazon.

~~~
atYevP
Yev from Backblaze here -> Yea, we hope they start doing more stuff like this
too! More information = everyone wins!

------
pgrote
Backblaze is fantastic for sharing all their hardware development efforts.
We've built our own pod for onsite storage and it's amazingly awesome.

~~~
atYevP
Yev from Backblaze here -> Nice! Glad it's working for you :)

------
abalone
What does 99.99999% annual durability mean in practical terms? That everyone
should expect to lose a few bytes per year? Or that only one in a million
customers will be affected by data loss?

(I've never been good at statistics.)

~~~
scott00
That number is probably the estimated probability that over the course of a
year, a single vault loses every file contained on it (which is what would
happen if a single vault had 4 drives simultaneously and irrecoverably fail).

As a consumer, the type of failure you would experience, and the probability
of experiencing that failure, given that Backblaze has suffered a vault
failure, depends on how they distribute your data amongst their vaults. They
don't explicitly say how they do this, so it's impossible to know for sure,
but we can consider the two extreme scenarios.

Scenario 1: Each customer is assigned a single vault, and all your files are
on it. In this case, if Backblaze lost a vault, you would either luck out and
have your files on another vault and be completely unaffected, or get really
screwed and have all your files on the bad vault, and lose them all. They've
got 150 PB of storage, and each vault stores 3.6 PB of data, so we can
estimate that currently you may have something like a 1 in 40 chance of having
your data on any given vault. So under this scenario, you would have a 1 in
400 million chance of losing all your files.

Scenario 2: Each customer's files are uniformly distributed across all vaults.
In this case, if Backblaze lost a vault, all customers would lose a fraction
of their files. Again, using our estimate that they might have 40 vaults, you
would have a 1 in 10 million chance of losing 2.5% of your files.

So up to now, we're basically just doing the math without questioning the
assumptions of the model. In reality, I think your practical risk is mostly
concentrated in things outside of the model: ie, an event that affects all of
their vaults simultaneously, like a fire, earthquake, meteor strike, etc. If I
had to make a bet about what that number is, I'd put it in the 1/10,000 to
1/100,000 range. In other words, orders of magnitude higher than losing data
because some hard drives failed, or a backblaze employee spilled his coffee,
or something like that.

~~~
abalone
Thanks. IMHO the greatest risk of data loss is bugs in the homegrown software
and/or operator error during maintenance operations, not a natural disaster.
We infallible software engineers always underestimate that stuff, but it's
usually the cause.

Also, I'm not worried. If that probability only concerns data loss on
Backblaze's side, even if it's 1/10,000, then that's still not the probability
of actual customer data loss. Because for that to happen there'd have to be a
simultaneous loss of data on the customer side as well. That probably extends
the durability considerably.

~~~
mdaniel
_We infallible software engineers always underestimate that stuff, but it 's
usually the cause._

My former boss used to say that 90% of all problems are cabling. His
percentage may be off, but the sentiment certainly isn't.

------
saosebastiao
I know this is a hijack that doesn't have anything to do with the article, but
every time I see a Backblaze article I can't help but say it: Give us a Linux
Client already!!!!

~~~
brianwski
Stay tuned.

~~~
kyledrake
Very much interested in this. I asked a while ago if I could build one myself,
was initially told it was OK but was later told it's against your ToS.

Days, Weeks, Months, or Years?

~~~
atYevP
Pretty sure we can't say much other than stay tuned at this point :)

------
super_sloth
Still no linux support?

Any plans for that or even just an API so I can write one?

------
tetrep
Their website requires TLS but doesn't support TLS 1.2. What technology are
they using to serve their website that cannot support a 7 year old standard?

~~~
theandrewbailey
I'm happy their webserver has better cipher suites than it did last week.

~~~
atYevP
Yev from Backblaze here -> we're working on it!

~~~
brianwski
Brian from Backblaze here -> seriously, I just beat that team up today again,
we'll get there.

------
ajays
FTA: "If one of the original data shards is unavailable, it can be re-computed
from the other 16 original shards, plus one of the parity shards"

So you would have to read 17x the data to recreate it. Given disk latency,
network bandwidth, etc. I'm guessing it'll take quite a while to recreate a
6TB HDD if it fails.

~~~
chiph
The drive operations (including waiting for the head to seek, etc) could take
place in parallel. There's also the chance the data needed was in cache. So it
might not be horribly bad.

~~~
ajays
We're talking about 102TB of data (6TB x 17), so a cache won't dent this
figure too much (especially, given that each of those disks is one of 45
drives in the pod, which means the cache is shared across all of them). Then,
each of the drives will also be serving files (or storing them), which means
disk heads will be seeking around all over the place while rebuilding the
failed drive...

~~~
Dylan16807
Actually, it shouldn't take particularly long. The key is to distribute.

Sending 102TB of data to the pod that's rebuilding would take forever, this is
true.

Instead you have each peer pod be responsible for 1/17th of the parity
calculations.

1\. 17 pods each read 17 megabytes and sends them across the network.

2\. Pod A gets all 17 of megabyte 1. Pod B gets all 17 of megabyte 2. etc.

3\. Each pod calculates their megabyte of the replacement drive and send it
off.

4\. Repeat until 6TB have been processed.

So this way each pod reads 6TB from disk, sends 6TB across the network,
receives 6TB across the network, and calculates one nth of the data for the
replacement drive.

It scales perfectly. It's no slower than doing a direct copy over the network.

Just make sure your switch can handle the traffic (which it already has to
handle for filling the vault in the first place).

