
1-2 year SSD wear on build boxes has been minimal - jontro
http://lists.dragonflybsd.org/pipermail/users/2015-February/207469.html
======
justizin
Look. :)

Yeah, the build machine churns a lot. But that work should be primarily done
by the FS cache, by buffers. Yes, it's going to write out those small files,
but if DragonflyBSD has any kind of respectable kernel, though should be a
solid curve, not lots of bursts.

I would love if my old colleague Przemek from Wikia would talk about the SSD
wear on our MySQL servers which had about 100k-200k databases per shard.

We wore the _fuck_ out of some SSDs.

You should replace your HDDs with SSDs, though, for a number of reasons, and
take the long view, as kryptistk noted the OP is doing. Really compare the
cost of SAS 15k drives and Intel 320s or 530s.

But I think in his place, you can take the words of the inimitable Mr.
Bergman:

    
    
      https://www.youtube.com/watch?v=H7PJ1oeEyGg
    

Stop wasting your life. But don't expect a machine that does lots of random
IO, like a database, to have 1-2% SSD wear after two years. It might not last
two years. If it does, use it more. Aren't you making money with these drives?
;)

~~~
Jabbles
There are several sites doing SSD stress tests. This one claims to have
written 2 Petabytes to a drive without failing:

[http://techreport.com/review/27436/the-ssd-endurance-
experim...](http://techreport.com/review/27436/the-ssd-endurance-experiment-
two-freaking-petabytes)

~~~
rasz_pl
2 petabates while validating write RIGHT AFTER it happens, and never powering
that drive down. Its great when all you need is a dev/null, not so much if you
need to power down from time to time and retrieve useful data later.

~~~
kalleboo
They did do a 5 day unpowered retention test
[http://techreport.com/review/25681/the-ssd-endurance-
experim...](http://techreport.com/review/25681/the-ssd-endurance-experiment-
testing-data-retention-at-300tb)

------
kryptiskt
"At some point in the next few years we are going to start getting HDD
failures on our blade server. It has ~30 hard drives plugged into it after all
(and ~12 SSDs as well). When that begins to happen I will probably do a
wholesale replacement of all HDDs with SSDs. Once that is done I really expect
failure rates to drop to virtually nil for the next ~20-30 years. And the
blade server is so ridiculously fast that we probably won't have any good
reason to replace it for at least a decade, or ever (though perhaps we will
add a second awesome blade server at some point, who knows?)."

That's taking the long view. :-)

~~~
Jolijn
Is hardware (motherboard, CPU, memory) that good nowadays that one can expect
it to last 30 years. I don't think it's designed with that kind of lifetime in
mind.

~~~
dperny
I mean, this is server hardware. One of the major differences between server
hardware and desktop hardware is build quality. I've got a ten-year-old 1/2U
rack server sitting in a closet that I bought for pennies at a surplus auction
that still runs great.

~~~
mahyarm
But would it really last 30 years with such things as lead-free solder?

~~~
moe
We will probably have to wait 30 years to really know.

FWIW, lots of hardware from ~30 years ago still works. I have a 27 year old
Amiga500 that still boots fine (many of the floppy disks have become
unreadable, though).

You can buy fully working vintage computers much older than that on eBay.

~~~
vacri
30 years of constant/daily use is different from a few years of use, then
pulling out of mothballs as a curio every now and then.

------
ChuckMcM
Love this quote: _" This is the first time I've actually contemplated NOT
replacing production hardware any time soon."_

There are two things that benefit from the turnover of machines, one is that
Intel stays profitable, the other is that hardware standards have a chance to
evolve. The move from ISA->PCI->VESA->AGP->PCIe on video cards would not have
been possible had people been holding on to their machines for 3 - 5 years
before buying new ones.

~~~
diydsp
> ISA->PCI->VESA->AGP->PCIe

That's one way to look at it. Another is that we might have gone
ISA->PCI->PCIe in a shorter overall timeframe b/c there were no distractions
to get short-term stuff to market.

~~~
ChuckMcM
True but the relative market size growth has an impact. Had we skipped VESA
and AGP for example, there would thousands more ISA slots (longer time in
market) and so the next generation, PCI in your example, has to burn more cash
getting "into" the space.

The obvious principle here is that innovation happens more rapidly in a market
where their his a large demand for improvement and a low friction for
upgrades. Pull back on either of those and it slows down the rate of
innovation.

------
jcampbell1
Wear failure on SSDs is often a silly thing to worry about. The number of
write-out cycles is roughly 10,000 and a modern spinning disk takes like 4
hours to write out, so it would take like 5 years of continuous writing to see
an advantage for traditional HDs. In practice, you only have to worry about
it, if the workload could never be considered on a traditional HD.

When faced with the choice between a Fiesta and a Ferrari, there are many
reasons to choose one over the other, but it is ridiculous to say "I picked
the Fiesta because the Ferrari has a known tire problem when going 200+ MPH".

~~~
dillondf
Durability has steadily decreased as flash density has increased. 10000, then
5000, then 2000 for standard MLC parts over the last few years as densities
have increased. I'm not sure what Samsung's new 3D process is. At the same
time, the voltage regulation and comparators used on-chip has gotten a lot
better, making it easier to detect leaky cells so reliability has
significantly improved for the erase cycles the flash does have.

The original Intel 40G SSDs could handle an average of 10000 erase cycles (for
each block of course), giving the 40G SSD around a 200TB minimum life if you
divide it out and then divide by 2 for safety. (Intel spec'd 20TB or 40TB or
something like that, 400TB @ 10000 erase cycles, divide by 2 gives you ~200TB
or so).

A modern 512G Crucial SSD sits somewhere around a 2000 erase cycle durability,
or around 512TB of relatively reliable wear (1PB / 2 for safety).

I would not necessarily trust an SSD all the way to the point where it says it
is completely worn out, I would likely replace it well before that point or
when the Hardware_ECC_Recovered counter starts to look scary. But I would
certainly trust it at least through the 50% mark on the wear status. Remember
that shelf life degrades as wear increases. I don't know what that curve looks
like but we can at least assume somewhere in the ballpark of ~10 years new,
and ~5 years at 50% wear. Below ~5 years and I would start to worry (but
that's still better than a shelved HDD which can go bad in 6 months and is
unlikely to last more than a year or two shelved).

-Matt

------
gojomo
Since it wasn't immediately clear to me how the percentages were pulled from
the smartctl output, I'll note that "Perc_Rated_Life_Used" is the relevant
readout.

(Some online sources mention "Wear_Leveling_Count", and even misreport that as
a percentage – but in fact that seems to be an absolute count of the number of
times each single block has been rewritten. The percentage is presumably this
Wear_Leveling_Count divided by the rated number of cycles.)

------
dillondf
There are certainly workloads that will wear out a SSD, random database writes
being the most common. But that is only a very small portion of the storage
ecosystem, not to mention that there are plenty of ways to retool SQL backends
to not do random writes any more, particularly since it doesn't gain you
anything on a modern copy-on-write style filesystem verses indexing the new
record yourself as an append. So I expect this particular issue will take care
of itself in the future. It's a matter of not blindly using someones database
backend and expecting it to be nice.

The vast majority of information stored these days is write-once-read-never,
followed by write-once-read-occasionally. I expected our developer box which
is chock full of uncompressed crash dumps and many, many copies of the source
tree in various states to have more wear on it than it did, but after thinking
about it a bit I realized that most of that data was write-once-read-hardly-
at-all.

In terms of hardware life, for servers there are only a few things which might
cut short a computer's life, otherwise it would easily last 50 years. (1)
Electrolytic capacitors. (2) Any spinning media or low frequency transformers.
(3) On-mobo flash or e2 that cannot be updated.

(1) Electrolytic capacitors have largely disappeared from motherboards in
favor of solid caps which, if not over-volted, should last 30 years.
Electrolytic caps are not sealed well and evaporate over time, as well as
slowly burn holes in the insulator. They generally go bad 10-30 years later
depending on how much you undervolt them (the more you undervolt, the longer
they last). Even so I still have boards with 30+ year old electrolytics in
them that work.

(2) Spinning media obviously has a limited life. That's what we are getting
rid of now :-). Low frequency transformers have mostly gone away. Transformers
in general... anything with windings that is, have a limited life due to the
wire insulation breaking down over time but most modern use cases in a
computer (if there are any at all) likely have huge errors of margin.

(3) Firmware stored in E2 and flash, or OTP eprom, rather than fuse-based
proms, will become corrupt over time. 10 years is a minimum life, 20-30 years
is more common. It depends on a number of factors.

Other than that there isn't much left that can go bad. All motherboards these
days have micro coatings which effectively seal even the chip leads, so
corrosion isn't as big a factor as it was 20 years ago. The actual chip logic
basically will not fail and since the high-speed clocks on the whole mobo can
be controlled, so aging effects which degrade junction performance for most of
the chip can be mitigated. I suppose an ethernet port might go bad if it gets
hit by lightning but I've never had a mobo ethernet go bad in my life. Switch
ethernet ports going bad is usually just due to poor parts selection or
overvolting which would not be present in a colocation facility or machine
room.

In anycase, there is no reason under the sun that a modern computer with a SSD
wouldn't last 30 years with only fan, real-time clock battery, and PSU
replacements.

-Matt

~~~
mato
> I suppose an ethernet port might go bad if it gets hit by lightning but I've
> never had a mobo ethernet go bad in my life.

Been there, record thunderstorm centered directly above a client's building.
Several switch and mobo ports died. The fact that said client "saved" on
cabling by running UTP between buildings probably had something to do with it.

~~~
dillondf
Well, I probably don't have to tell people to never run copper between
buildings. Always run fiber. The longer the runs, the higher the common mode
voltage between grounds and the higher the stress on the isolation circuitry.
Plus lightning strikes don't have to hit the cable directly, they can pull up
the ground for the whole building and suddenly instead of having 200V of
common mode you have 4000V for a few milliseconds.

Anyone who has ever wired up a T1 in the basement of a highrise knows what I
mean.

-Matt

------
kabdib
Not sure why you'd want to buy 10-year capable storage for a server.

I buy for reliability. If I can plug something in and just forget about it
(except for patches) for 3-4 years, we're done and whatever new thing I can
buy will pay for itself in power savings.

I'm happy that an SSD will last that long, but it's not something I worry
about.

I _have_ had to rebuild servers because of abject SSD failure, and will no
longer buy those brands. Failure seems to be quite highly correlated with
trying to save a buck by getting non-top-tier drives (whereupon it's me, or
someone like me, picking up the pieces when I could be writing code. Screw
that).

If a drive was happily at 90% wear after two years, I'd just provision more of
the same drive. Yay! :-)

~~~
dillondf
Basically only buy SSD brands who either are chip fabs or have a relationship
with a single chip fab. So. Intel, Crucial, Samsung, maybe one or two others.
And that's it. And frankly I have to say that only Intel and Crucial have been
really proactive about engineered fixes for failure cases. Never buy SSDs from
second-tier vendors such as Kingston who always use the cheapest third-party
flash chips they can find. There are literally dozens of those sorts of
vendors. Hundreds, even.

-Matt

~~~
gonzo
Micron is the name behind Crucial.

Kingston has gotten very good (and they used to suck) at least in the eMMC
space (not quite a SSD, but close).

~~~
kabdib
Crucial are the drives I've been ripping out. Enterprise class. They just died
after 2 years of service and not really all that much wear, won't talk to the
outside world anymore.

I nailed one to the wall of our IT lab as a warning not to buy any more of
them.

------
mrmondo
I'm currently working on a project to replace our older SANs with SSD-only
storage servers - I've performed a few POCs with great results and am now
documenting the build as I go.

Not only is it going to give us some much needed IOP/s it's also going to save
us hundreds of thousands of dollars on storage over the next 3 years.

If you're interested take a look at: [http://smcleod.net/building-a-high-
performance-ssd-san](http://smcleod.net/building-a-high-performance-ssd-san)

------
unluckier
None of those drives show evidence of regularly-scheduled self tests. They
must not really care about them.

~~~
wtallis
What sort of self-testing do you think is needed?

~~~
unluckier
Daily short tests and weekly long tests seems reasonable. These drives show
that they've gone thousands of hours since the last test.

~~~
dillondf
It's really unclear whether explicitly initiated tests actually help for a
SSD. The SSD has its own internal mechanisms to scan the flash chips which (I
presume) is unrelated to SMART, since they are required for normal
operation... primarily detecting weak cells before the bits actually go bad.
Whole chips can go bad out of the box but after an SSD has been running for a
few months there isn't much left that can go south other than normal wear or a
firmware failure.

firmware issues dominated SSD problems in the early years. Those issues are
far less prevalent today though Samsung got its ass bitten by not adding
enough redundancy and having data go bad on some of its newer offerings.
Strictly a firmware issue. Which is another way of saying that one should
always buy the not-so-bleeding edge technologies that have had the kinks
worked out of them rather than the bleeding-edge-technologies that haven't.

If it starts to bite me I may change my tune. But until that happens I put it
in the wasted-cycled category.

-Matt

