
CamelCamelCamel: $44.6k disk failure, down until Feb 3 - lwhsiao
https://camelcamelcamel.com/
======
raiyu
As others have pointed out using consumer grade SSD drives for a database is a
bad idea. Databases are going to do a lot of reads and writes so that's going
to lead to problems down the road.

The second issue, is that since this was setup in a RAID I'm assuming that all
of the drives were purchased at the same time when it was originally setup
which means depending on the supplier you are probably going to receive drives
from the same batch of production. Because of striping with RAID if you have
some fault in a batch and multiple drives from the same batch then you can
have failures on multiple drives that occur at approximately the same time.

The simplest solution is of course to buy enterprise grade drives but the
obvious issue there is the price is much higher. Especially if you are going
to be purchasing the largest drives as you always pay a premium for max
capacity drives and the price jump on large drives is not insignificant.

When we first built out our SSD cloud for DigitalOcean at the end of 2012 we
went with consumer drives because we were bootstrapped and SSDs were also
significantly more expensive so we couldn't justify the price jump without
knowing if we had product market fit.

But rather quickly we ran into a whole host of issues. Another common issue is
that performance can degrade when consumer drives get to near full disk
capacity so you can't really use all of the available space if you plan to do
a significant number of reads and writes, which is pretty much what a database
server is designed for.

That was the first upgrade we did after raising money, which is switching to
enterprise grade drives, unfortunately you have to pay the premium but it
really reduces problems significantly.

~~~
fabioborellini
I do believe your experience, but started wondering what makes an enterprise
drive more reliable than a consumer-targeted one. What's preventing the
manufacturers from selling the same drive with two different names and prices?
Reputation?

~~~
mirimir
One difference is that enterprise drives have lots more spare capacity. Like
>>100%.

~~~
u02sgb
This can be replicated in consumer drives by underprovisioning.

~~~
mirimir
OK, fair point. Although I don't know enough about SSDs to know whether
there's more to it.

But what else is different?

~~~
u02sgb
I think mainly warranty, although other posters are saying enterprise drives
have capacitors to help with writing during power loss. I'd have thought a
good battery backed raid controller would handle this but "enterprise"="belt
and braces" so I can see why a lot of businesses pay the extra.

~~~
tinus_hn
So the enterprise drives perform as specified and the consumer drives don’t?

~~~
u02sgb
Sorry, late to reply! The warranty differences are things like with enterprise
they'll send you a new one as soon as the SMART data suggests it's going to
fail. Consumer is usually you post it back and they send you a new one.

------
akerro
>On the evening of Saturday, January 26th, our database server had three hard
drives fail. It was designed to handle two disk failures, but three failed
disks made the situation catastrophic.

Like in the joke.

Our data storage was so well designed, that when first disk died, we didn't
notice. When second disk died, we also didn't notice. When third disk died, we
noticed.

~~~
howiroll
It’s likely that the three drives died concurrently.

~~~
FPGAhacker
And they are probably setting themselves up for a similar fate in the future
by replacing them all (failed and working alike), and at once.

~~~
lucb1e
They don't fail that reliably, unfortunately. It's not a light bulb with a
certain number of hours on it. See Backblaze's drive statistics.

Edit: I just saw it's about ssds and not hdds, but while Backblaze might not
have stats on those, I'm fairly sure my comment still applies. Not 100%, but I
assume they account for failures due to predicable issues like write wear.

~~~
nemothekid
> _I 'm fairly sure my comment still applies_

There are other comments in this thread that go into this, but your comment
doesn't apply. Another poster used the light bulb analogy well to describe SSD
failures.

It's very common to see SSDs that were purchased at the same time, that were
likely manufactured in the same batch, fail within hours of each other. Im'
pretty sure I even read this fact on Backblaze.

------
chousuke
To me, using consumer-grade SSDs seems less the problem compared to not having
a database replica (for quick recovery) and PITR -capable backups (for near-
zero data loss). These are not difficult to set up and are definitely worth it
if your data matters.

In most cases your disaster recovery scenarios should include the complete
destruction of a single server.

~~~
kalleboo
Yeah I was surprised by that too. A write-only replica might even be doable on
cheap spinny rust disks if it's just product and account info.

------
sdan
Using consumer grade SSDs is a huge no no. For a company this size a simple
server rack with something like production ready Seagate drives would do
better in the long run. In my view, the SSDs they bought make CCC look like a
school side-project.

~~~
itake
I can't believe a company of this size is asking for donations...

~~~
jnwatson
I never got the impression they were big at all.

~~~
ElCapitanMarkla
Yeah I always thought it was a one man band until now

------
howiroll
Why do they use Samsung consumer drives on serious things?

Consumer SSDs lack IOPs and should be never used on database drives.

Also, I never use consumer SSDs on things that are not a joke. Serious things
need enterprise SSDs.

~~~
whitepoplar
I think some people rely on the historical fact that consumer 7200rpm SATA
drives used to be nearly identical to their enterprise 7200rpm counterparts.
Consumer SSDs are very different from enterprise SSDs; the former typically
lacks power loss capacitors, which should be a nonstarter for anyone using
them to run databases.

~~~
u02sgb
Surely a battery backed raid controller would be enough? That's how we did it
with spinning drives in the past.

------
latch
I guess they're busy now trying to fix things, but since they didn't mention
how to avoid this in the future, I feel like streaming replication [to much
cheaper spinning disk] or some other form of point in time recovery is the way
to go.

------
u02sgb
Consumer SSDs can absolutely be the right choice in some situations. Great for
scrappy development environments or things that are refreshed regularly like
running ETL migration runs. I'd advise underprovisioning them by 10% and
keeping good (regular) backups though. Also being aware of the failure of two
drives of the same age in RAID1 being common (same number of writes=similar
failure time).

A shame that CamelCamelCamel seem to be running on a bit of a shoestring
budget as it's a very useful tool. To be fair to the recovery costs though the
majority of them are from a professional data recovery company and that ain't
cheap!

------
tass
Does anyone know whether they got unlucky with 3 simultaneous failures, didn’t
notice earlier failures, or had some other issue that killed all the drives?

~~~
cosmin800
Well, from what I see in the pictures they are using consumer grade ssds to
power up an enterprise like grade database. This is not bad luck, this is just
bad by design.

~~~
howiroll
And they are doing the same ’mistake’ again.

------
cjbprime
What is CamelCamelCamel?

~~~
derstander
As dddddaviddddd said, it's a price tracker for Amazon. To go into a little
more detail: you can enter a product name or Amazon URL and it'll show you
first and third party pricing history. You can also set a price threshold on a
particular product and it'll either tweet or email you when the product price
drops below that point. You can even do it without signing up or logging in.

I've set notifications on a couple items that I regularly order and when they
drop below a certain price I order enough in advance.

~~~
majewsky
> You can even do it without signing up or logging in.

Then what's their business model? Do they sell aggregated customer traffic
data to merchants?

~~~
klohto
I'm guessing that referral is a big drive

------
lazylizard
wouldn't using a raid card mean that the wear on the SSDs will be almost the
same? i.e. they are likely to fail together?

------
bdibs
This is why you shouldn’t roll your own hardware (unless you have very, very
specific needs).

~~~
Plastikdusche
Wondering how much their annual cloud bill would have looked like compared to
the bill they now got

~~~
nemothekid
56TiB of gp2 SSD storage on Amazon would cost $6,500/mo (so $78,000/yr) - so
if its a 1:1 setup they could afford to fail a couple more times. Now there's
the question of would they need all that space in a cloud setup, but thats
another story.

What concerns me, from the image, is that they seem to be using cheaper,
consumer grade drives (Samsung 860 PROs I assume). This is asking for trouble
- as they went through - 3 drives failed simultaneously. I've only ever dealt
with Cloud infra, but even I know that drives that are bought together, tend
to fail together. It's likely their new batch of drives will also fail
simultaneously - backblaze has done a ton of research on drive longevity.

Just seems penny-wise, pound-foolish to me.

~~~
realusername
> drives that are bought together, tend to fail together

It's exactly the same as the lights in your house, all of these were made at
the same batch with the same quality of material and used the same way, they
tend to fall apart within a few days from each other.

------
andretti1977
What wonders me most is the fact that 44k $ is a great economical loss for
them: i would expect CCC to earn at least a milion/year so that 44k $ would be
nothing more than a little annoyance

~~~
ac29
$1M/year seems high, but even if that's right, $44k isn't small. That's most
(or maybe all) of someone's annual salary. There only seem to be 7 people
associated with the company [0], so that's a big loss.

[0] [http://www.cosmicshovel.com/](http://www.cosmicshovel.com/)

~~~
andretti1977
If they say 44k $ is a significant loss then i agree they maybe gain less than
1mln. That said i expected them to be a lot more profitable

------
gonesilent
it's just 3 people and some servers hosting what looks to be one of the top 5k
sites in the world based on traffic.

------
fraXis
They didn't have a backup?!? Look how much they are paying for data recovery!

~~~
redisman
Did you read the very short post?

~~~
middus
If the only backups they have are so old that they are deemed obsolete, that's
essentially having no backups at all.

