
Pocket Watch: Verifying Exabytes of Data - nuriaion
https://blogs.dropbox.com/tech/2016/07/pocket-watch/
======
james_cowling
Hi folks, James from Dropbox here. Happy to answer any questions that come up
so feel free to send them by way.

~~~
toomuchtodo
Was shocked to see that your durability (27 9s) was so much higher than what
S3 claims (11 9s), while also charging less for storage and bandwidth than S3
would. Amazing!

I would be curious to see how much of your verification architecture is shared
with someone like Backblaze, but I assume the only way to learn that would be
to work for both companies to compare :)

Would Dropbox ever share hard drive reliability stats similar to what
Backblaze does?

Sorry for all the questions in one comment! Not affiliated with either Dropbox
or Backblaze, just an infrastructure wonk.

~~~
jerf
27 9s is literally higher than my confidence that human civilization will be
here in the next second. 10^27 seconds is about 32 quintillion years.
Extinction events occur with a much higher frequency than that.

No criticism of Dropbox here; they know that number is just math games, too.
I'm just putting some numbers on how true that is. Because I find tossing
about big numbers like this as if they mean something a bizarre, nerdy sort of
fun.

~~~
gamegoblin
Exactly. They say they are using some variant of Reed-Solomon erasure coding.
If you did something like K=100 and N=150 and stored all of the shards on
different disks, the probability you lose data is equal to the probability
that 50 hard drives fail before you can repair the lost shards.

If I am reading the article correctly, they claim that they should usually be
able to repair in less than an hour in the case of disk failure.

Thus the probability of losing 50 (or whatever their N-K value is) disks
within an hour is how you get 27 nines of durability.

Of course, the probability that one of your software engineers introduces a
durability bug is WAY more likely than those disks experiencing a coordinated
failure.

Or say, the probability that a terrorist organization targets your
datacenters. Even if those odds are one in a billion, that's still not even
close to 27 nines.

~~~
james_cowling
For sure. I hope we're all agreeing here :)

We've very strong believers that an effective replication strategy is just
table stakes and that from there the real risks to durability are the "black
swan" events that are much harder to model.

I gave at talk at Data@Scale recently where the main premise is about
"Durability Theater" and how to combat it in a production storage system. In
case you're interested: [https://code.facebook.com/posts/253562281667886/data-
scale-j...](https://code.facebook.com/posts/253562281667886/data-scale-
june-2016-recap/)

------
pbarnes_1
This system is pretty amazing. Thanks for sharing all the details as you went
along.

You guys have made me interested in rust, which I still find a little too
verbose vs golang. :D

~~~
jamwt
Thanks!

We're expanding our use of Rust into even more bold endeavours, details soon.

Based on our experience with it in the last 1.5 years, stick with it, it will
return your effort many times over in the medium-long run. Then again, if
there is no medium-long run for your project (as an entity needing
maintenance), just use Python.

