

Hard Drive Data Sets - epistasis
https://www.backblaze.com/hard-drive-test-data.html

======
budmang
At Backblaze, we've done a lot of our own analyses on our hard drives to look
at which ones are most reliable, whether temperature affects reliability, etc.
However, we kept being asked for the raw data so people could run their own
analyses.

This data set releases 500 million data points on 41,000 drives. I imagine you
guys here at YCombinator/Hacker News are the ones who'd find this data amongst
the most useful. Enjoy and let us know what you find!

~~~
crackulator
Your website uses a terrible HTTPS config.

[https://www.ssllabs.com/ssltest/analyze.html?d=www.backblaze...](https://www.ssllabs.com/ssltest/analyze.html?d=www.backblaze.com)

No PFS, weak RC4, TLS 1.0 only, and SHA1.

It's not much effort to fix these, really. It makes me question the security
of your product.

~~~
atYevP
We got a B! That's above average! Totally not "terrible". Truth is the website
and the service itself are fairly different, but I can tell you that we have a
lot of folks using, lets say..."older", browsers that access our site, and so
we try to make sure that they can still access their accounts - though we're
constantly monitoring for ways to make it better. A lot of folks that use us
have older operating systems and that hampers us a bit.

~~~
GeorgeHahn
Even though it stops folks with updated browsers from accessing your site?

[http://i.imgur.com/3GCKrzr.png](http://i.imgur.com/3GCKrzr.png)

~~~
brianwski
Brian from Backblaze here. Which browser was that? We often test with
Mac/Windows and Safari/Chrome/IE/Netscape/Opera, etc. It all seems to work
here. Did you tweak any browser settings or are you running stock?

~~~
miduil
Unable to Connect Securely

Firefox cannot guarantee the safety of your data on www.backblaze.com because
it uses SSLv3, a broken security protocol. Advanced info:
ssl_error_no_cypher_overlap

Firefox 35, rc4 disabled

As far as I understood you should not (especially not only) use rc4.
[https://en.wikipedia.org/wiki/RC4](https://en.wikipedia.org/wiki/RC4) even
cloudflare is trying to get used of rc4. [https://blog.cloudflare.com/killing-
rc4/](https://blog.cloudflare.com/killing-rc4/)

> Fast-forward to 2013 and attacks on RC4 have been demonstrated; that makes
> the preference for RC4 problematic.

~~~
yuhong
Looks like they killed 3DES, leaving only RC4.

------
L_Rahman
If you haven't already come across Backblaze's series of blog posts based on
these data sets, they're worth reading. Invaluable in helping me make an HDD
purchasing decision and full of well presented data.

[https://www.backblaze.com/blog/best-hard-
drive/](https://www.backblaze.com/blog/best-hard-drive/)

~~~
fnordfnordfnord
Way better than Tom's Hardware, et al.

------
DanBC
Thanks for this!

I'd be interested in whether shucking drives from external enclosures has any
noticible effect on drive life. But the data doesn't seem to capture whether
the drivers were shucked or not?

Is that something Backblaze has investigated? Or is the need for drives such
that it doesn't matter if shucking does cause shorter life?

~~~
budmang
The tracking on that wasn't easy to align with these data, but from what we've
seen the shucked drives seemed to perform similarly. At this point, the
percentage of shucked drives in the data set is fairly small.

------
peterebailey
I merged the two years of data into a single R data file for convenience:

[http://pyrovski.github.io/backblaze_data/](http://pyrovski.github.io/backblaze_data/)

------
skore
Nitpicking: It seems you might have an issue with font loading - for me
(Firefox on Linux), it reverts to font-weight 100, making the text (which is
missing <p> tags, by the way) almost completely unreadable.

Fig A.: [http://i.imgur.com/9rLQwQQ.png](http://i.imgur.com/9rLQwQQ.png)

~~~
brianwski
Brian from Backblaze here.

I keep complaining to the visual designer about this, I can't figure out why
this is so hard to fix. What's really strange is it often looks GREAT in some
web browsers nobody would ever use (IE) but in Chrome on Windows the lower
case "g" characters are almost unreadable and disappear.

If only somebody knew how to fix this?

How did you detect it was missing < p > tags? Is there a tool the designers
could error check against to see this error?

~~~
skore
> How did you detect it was missing < p > tags?

Just opened it up in the inspector in Firefox, no black magic here ;-)

I would definitely look into your font names - it tells me that "Lato-
Hairline" is used as: "Lato" and "Lato Regular" is used as: "@font-face:Lato".
So perhaps the issue is that Lato-Hairline is the font-weight = 100 and
Firefox picks it over the other one, only finds the single font-weight and
sticks with it?

Just a guess though, webfonts can be weird. For instance: For Chrome, it can
depend on what version you use. Just today I ran into a font rendering bug
similar to this one where Chromium versions 37&38 had their font-weights
switched so that 300 ended up as 500 and 500 was also picked for "normal". So
the bug report "all fonts are bold and it looks terrible" resulted in "CANTFIX
old Chrome be weird", basically.

~~~
brianwski
Cool, thanks! I have forwarded the info, it's actually been driving me
slightly bonkers. The designer uses a Macintosh, but my primary development
box with a 30" monitor happens to be BOTH Macintosh OR Windows (KVM switch)
and when I see our blog in Windows it looks terrible.

------
miduil
@brianwski [Offtopic] Is backblaze going to implement delta copy or something
similar in soon future? The last time I checked it definitely didn't. This
becomes a real issue if I'm working on bigger binary files, since backblaze is
syncing the whole file again - instead of only it's difference... PS: Found
this interesting comparison of backup services:
[https://en.wikipedia.org/wiki/Comparison_of_online_backup_se...](https://en.wikipedia.org/wiki/Comparison_of_online_backup_services)

~~~
brianwski
We do transmit "changes" to large files in 10 MByte chunks. In other words, if
1 byte changes in a 50 MByte file it SHOULD only transmit one single 10 MByte
chunk.

The absolute worst case for Backblaze is if you insert a single byte at the
start of a large file. This "shifts" the entire file along by the one byte,
effectively changing every single 10 MByte chunk.

The BEST case is if you append a single byte to a large file, because the
final chunk then is probably less than 10 MBytes.

I actually thought we would be working on that area quite a bit over the years
but it kind of worked well enough. :-) Most people don't edit large files,
with the exception of Outlook.pst files, we see those appear as bandwidth
burners.

~~~
miduil
Thanks for your reply, brianwski. That's seems fair enough.

------
alsocasey
This is very much appreciated:

1\. As mentioned by others, there really is very little data on HD failure
rates.

2\. When you first published your blog on failure rates across HD
brands/models and SMART attributes many, myself included, suggested it might
be more illuminating as a predictive modelling exercise. This data allows
others to do that now, which is great!

------
justcommenting
as someone who made previous comments on backblaze data analyses posted on HN,
i wanted to say thanks. this is fantastic, and i'm looking forward to digging
into these data! and even though i share some of the same sentiments from
other commenters, i'm sorry you've gotten so many bike shed remarks from other
commenters.

~~~
atYevP
Enjoy ;-)

------
mturmon
Releasing this data is a real service, thanks.

I'm unable to explain the plethora of comments nearby about peripheral issues.
Weird.

~~~
atYevP
No worries! Glad you like our release! We don't mind the peripheral stuff,
we're over like ->
[http://i.imgur.com/MYHLwt7.gif](http://i.imgur.com/MYHLwt7.gif)

------
ilzmastr
thanks for sharing this! A few quick questions:

\- Do you guys do any precise prediction on if a particular drive will fail
soon and replace it?

\- I notice a lot of sparsity in some rows, that is different than a 0 in that
field I assume? Does that mean anything else interesting?

\- Also under the "inconsistent fields" section you say "drive manufacturers
don't generally disclose what their specific numbers mean," can you give a
hint as to one of the drive models that has a minimally sparse smart readout
and has information available from the manufacturer on what those smart
numbers signify?

I figure if anyone has collected the references on what the metadata means,
and for which models it is available, it's you guys :)

~~~
brianwski
I'll try to get the datacenter techs to answer tomorrow, but here is my best
off the cuff attempt:

> predict if a drive will fail

We have some heuristics (high numbers of time outs and high remapped sectors),
but in the end most failures are sudden and catastrophic. It is more like
statistical tendencies, the most obvious one being drive age.

> sparsely in rows

Others have noticed, I have to ask the OTHER Brian (Brian Beach did the lion's
share of this drive stat collection and presentation).

> its you guys

Aww shucks. :-) But remember, we do this as a guide for ourselves, but we
spend most days working on backup features and scaling, we don't have a lot of
extra time. That's why we sending this data out there, some smart grad student
or PhD in Xerox PARC can hopefully figure out some good stuff we missed!
Besides, I don't think math and statistics are our strength, we just happen to
be sitting on one of the world's larger stock piles of spinning drives with
access to the computers with scripts. :-)

------
mischanix
Minor nitpick: using LZMA to compress large text files like this before
distribution is normally better; here 7z is LZMA, zip is DEFLATE:

    
    
        739M    2013_data
        37M     2013_data.7z
        78M     2013_data.zip
    

Using [1] for reference, if the download speed is less than ~20MB/s, LZMA is
faster than DEFLATE. Though, the data there is a bit less compressible than
these csv files, so the break-even point for transfer rate would be higher
here; even so, in my case, the download speed was much slower than 20MB/s.

1\. [http://richg42.blogspot.com/2015/01/parallelized-
downloaddec...](http://richg42.blogspot.com/2015/01/parallelized-
downloaddecomp-with.html)

~~~
brianwski
Disclaimer: I work at Backblaze.

We tend to favor ZIP as a company because with no additional tools on all
platforms (like Macintosh and Windows) it unpacks. No additional technology
other than what the manufacturer provides.

In this case, we could provide the raw data in several formats in case
bandwidth is a problem, plus I think we should provide the SHA2 or md5 of the
resulting package just in case you are wondering if you got the correct
download or whether somebody has messed with the contents.

~~~
ars
Windows has the 7zip program which handles lzma just fine - it's overall a
better program than zip anyway.

~~~
pests
He said provided by the manufacturer by default, not something the user has to
install themselves.

------
aceperry
Nice. Glad to see more info out there for people to see and use. Backblaze has
done a great service for everyone including the hard disk industry.

------
avani
I'm really looking forward to playing with these; thanks for releasing them,
especially since there is so little failure data out there.

------
moe
This is a great resource, Thank you!

