
New hard drive rituals (2018) - walterbell
https://blog.linuxserver.io/2018/10/29/new-hard-drive-rituals/
======
mceachen
This article has an unfortunate omission by ignoring drives' S.M.A.R.T.
reporting system which can give a surprisingly thorough view into drive
metadata, like how many blocks have been remapped due to media corruption.

After running `badblocks`, run `smartctl -t long $DEVICE`, which is a "long"
self-test. This will take upwards of a day to complete, depending on the
drive.

After that completes, check the status by running `smartctl --all $DEVICE`,
and verify that there are only zeroes in the RAW_VALUE column for
Reallocated_Sector_Ct, Used_Rsvd_Blk_Cnt_Tot, and Uncorrectable_Error_Cnt.

~~~
rsync
It's also very illuminating to check SMART stats on "new" drives that show
hundreds of on/off cycles or other evidence that the drives are not, in fact,
new.

Pro tip: Many, many drives sold on Amazon by large, "reputable" suppliers are
not new in the way you think they are - they are "new pulls" from systems
integrators that received them as part of the complete PC and then pulled
them, en masse, to sell in bulk.

Those "new pulls" may have hundreds and hundreds of hours on them ...

We (rsync.net) have started trending towards B&H Photo for larger batches of
drives ... they pack them cluefully and there is no "new pull" monkey-business
... unfortunately they don't always have the quantity on-hand that we require,
but that's not a knock against them ...

~~~
tomc1985
Wow, so I guess people are buying new PCs and essentially scrapping them for
parts? That's....... actually kind of funny

~~~
hakfoo
When I bought my present laptop, it was about $200 cheaper to order the min-
spec (4G RAM/500G hard disc) and add aftermarket RAM and SSD upgrades than to
buy it the way I wanted from the maker.

I could imagine a firm that ordered 1000 units and did (or contracted out for)
the same upgrades, now has 1000 unneeded 500-gig hard drives they can unload
cheap.

------
jolmg
On the topic of new rituals for storage devices, whenever I buy a microSD card
or a USB flash drive and before I setup the filesystem, I've been filling it
with random data and reading it back to ensure it does have the storage
capacity it's supposed to have.

I started doing this after reading comments online (probably here) on
fraudulent cards and drives circulating around Amazon. Some comments mentioned
that there are even some devices that have their firmware set-up such that
writes don't fail even when you've written more than what the device can
actually hold. They supposedly manage this by reusing previously used blocks
for new files. I suppose in other words, each block has multiple addresses.
The addresses would probably loop around the same blocks until the fake size
is reached. That would give it the most realistic appearance that it's of the
reported size while silently corrupting previously written storage.

In any case, so far I've only gotten 4 cards/drives since then, and I haven't
found them to be frauds.

I've since wondered how much I'm subtracting from their product lifespan with
my testing. I mean, I imagine that microSD cards and USB flash drives do wear
leveling on unused blocks, but without TRIM support on them, after testing
them, I'm probably eliminating the devices' ability to do wear leveling since
I'm occupying all blocks without the ability to tell the device that they're
unoccupied now.

As far as I understand, the problem with lacking TRIM support on USB flash
drives is not that there's no protocol for it through USB, since there are
USB-SATA adapters that specify support for UASP. Admittedly, though, I'm
assuming that UASP is enough to get TRIM support. I don't remember testing
that.

In any case, I wonder if this new ritual I'm doing is really worth it, and if
it's not, when would it be?

EDIT: I wonder if it's realistic to expect that one day we'll get USB flash
drives and microSD cards with TRIM support. I don't expect the need to check
for fraudulent storage devices to go away.

~~~
tialaramex
Filling with random data is advisable anyway if you use encryption for
storage, which almost everybody should do (it's clearly fine to store stuff
that is genuinely public unencrypted, but if you're not sure probably not
everything is public and so just encrypt everything)

The reason to write random data is that otherwise an adversary inspecting the
raw device can determine how much you really stored in the encrypted volume,
as the chance of a block being full of zeroes _after_ encryption is negligible
so such blocks are invariably just untouched.

The encryption will ensure that random bits and encrypted data are
indistinguishable (in practice, in theory this isn't a requirement of things
like IND-CPA it just happens that it's what you get anyway).

~~~
zamadatix
What's the cause for concern on someone knowing how much of the drive is
unused?

~~~
jmt_
I believe that could be used to determine if encrypted hidden volumes are in
use

------
rkagerer
Back around 2010 I got a great deal on eight PACER SSD's from China. At the
time, SSD's were still quite expensive compared to spinning platters.

My performance benchmarks were fantastic, but I started seeing some corruption
issues. I wrote a C# program to fill the entire disk up with pseudorandom
bytes (starting with a known seed), then read it back and compare.

Nearly all of them gave back different bytes at some point in the test. These
were silent errors, and didn't trigger any read warnings or CRC/ECC failures
reported by the drive's controller.

I suspect they achieved the performance by simply ignoring any errors and
steamrolling right along.

~~~
jojo2000
Bad blocks bad blocks what ya gonna do, what ya gonna do when they come for
you.

Hard drives fail hard or smooth, SSDs fail hard ! On a hard drive, we can
picture how the blocks are stored (sequentially). On an SSD, there is a
remapping of blocks in silicon+firmware, and this added layer adds its own
bugs [0] [1].

[0] [https://blog.elcomsoft.com/2019/01/why-ssds-die-a-sudden-
dea...](https://blog.elcomsoft.com/2019/01/why-ssds-die-a-sudden-death-and-
how-to-deal-with-it/)

[1] [https://www.anandtech.com/show/6503/second-update-on-
samsung...](https://www.anandtech.com/show/6503/second-update-on-samsung-
ssd-840840-pro-failures)

(edit : added one reference more)

------
rini17
Badblocks writes easily compressible patterns and I assume harddrive
manufacturers try to compress and deduplicate data, masking any defects.
Especially on SSDs and SMR. So I prefer to fill the drive with AES-encrypted
data like:

yes | openssl enc -aes-256-cbc -out /dev/test-drive

and then decrypt and grep. It's almost as fast as badblocks.

~~~
JensRex
Per recent postings on HN about /dev/urandom, wouldn't it fine to use data
from that, rather than bothering with AES output. /dev/urandom should be
cryptographically safe.

~~~
jpdaigle
But how would you detect corruption in this case?

If you write random bytes then read them back, you have nothing to compare
them to.

~~~
tzs
Let b be the number of bytes per block.

1\. Test N blocks of the disk, where N is small enough that you can keep a
128-bit hash of each block of random test data in memory for the duration of
the test. You can use the hashes to verify that read data is correct.

2\. Then test another Nb/8 blocks using the N blocks from #1 to store 128-bit
hashes of the Nb/8 test blocks.

3\. Then test N(1+b/8)/8 blocks, using the blocks from #1 and #2 to hold the
hashes.

4\. Then test N(1+b/8 + b/8^2)/8 blocks, using the blocks from #1 and #2 and
#3 to hold the hashes.

...

(Maybe replace the 8's in that with 7's, and in each block of hashes include a
hash of the 7 hashes, so you can check that hash blocks are reading back
fine?)

~~~
jeremysalwen
So basically you are demonstrating that the method is more complex than OPs,
with absolutely no benefit, which I think was the point.

------
cmurf
For badblocks, I suggest a block size larger than 4KiB, it'll go faster. The
legacy reason for 4KiB: badblocks can produce a badblocks file that mke2fs can
use to avoid using known bad sectors, however the block size for badblocks and
mkfs must be the same. Since bad sectors are managed by the firmware these
days, you're best off not using this legacy feature, therefore you can up the
badblocks block size well above 4KiB.

Another idea for a generic approach, whether HDD or SSD, is to use f3. It'll
check for fake flash (firmware loops to report a bigger device size than real
size), corruptions, and read/write errors.

~~~
spartas
[https://unix.stackexchange.com/questions/202541/how-can-i-
ch...](https://unix.stackexchange.com/questions/202541/how-can-i-choose-
optimal-values-for-block-size-and-blocks-number-for-badblocks)

------
nickcw
Here is a little program I wrote to stress test disks. It was originally
written for HDD but it works well on SSD and Flash too. I test all my media
with it for 24 hours. In fact I'm testing two 5TB external Seagate drives as I
type!

[https://github.com/ncw/stressdisk](https://github.com/ncw/stressdisk)

~~~
mceachen
Well done. Thanks for sharing your efforts!

------
userbinator
My preferred software for scanning a newly purchased disk is MHDD:

[https://hddguru.com/software/2005.10.02-MHDD/](https://hddguru.com/software/2005.10.02-MHDD/)

Because it times how long it takes to read/write each sector, it has the
advantage of being able to detect "weak" sectors before they even become bad
ones --- HDDs will retry accesses multiple times if an error occurs before
giving up, and to a program that simply reads and writes data, weak sectors
that successfully read after a small number of retries aren't noticeable. They
will show up in MHDD, as taking longer than usual to read.

------
zepearl
I'm thinking since a while about replacing mdadm's raid5 with ZFS' RAIDZ in my
two NAS:

if I understood correctly, in the case of single/few block failures ZFS would
transparently remap the bad block(s) present on the bad HDD to a good block on
the same HDD (therefore without having to immediately replace the partially
faulty HDD and rebuild the whole raid) => is this correct?

I admit that continuing to use the partially faulty HDD might be a bad idea,
but at least having a raid 100% healthy before replacing the partially faulty
HDD gives me a good feeling... .

~~~
dijit
It’s pretty standard procedure for a drive to stop using known bad blocks
transparently though.

These are simply called “bad blocks” in S.M.A.R.T - I guess it’s a little
similar to overprovisioning and wear in SSDs.

~~~
zepearl
Aha! I wasn't aware of that at all! Thank you!

------
grizzles
25 in the last 10 years. Must be nice. I buy that many each year. I often buy
new drives that can't detect their own errors that ZFS finds. Because a faulty
device by definition can't test itself.

SMART doesn't work anymore if it ever did. Manufacturer firmware blobs have a
huge incentive to report a successful test no matter what. Also as someone
else mentioned, there are plenty of used drives being resold as new in the
market.

This industry has skated for way too long in my opinion. They benefit from a
big reluctance on the part of people & companies to return faulty drives to
protect sensitive company data.

I have seriously considered starting an ecommerce shop that sells only hard
drives. But also does backblaze like tracking / reporting of individual
drives, has a good return policy and degausses returned drives.

Speaking as a consumer I have zero interest in the current system, I want a
system that lets me take the TCO of a drive/drive class into account before I
make a purchasing decision.

------
thanksforfish
I'm curious about what prevents the manufacturers from running badblocks and
eliminating some of the downstream pain. Obviously some users care and will
spend the effort to check their drives, and some unlucky users will have a
data loss and a bad experience.

Is not running badblocks before shipping a way to exploit users who won't RMA
or won't realize their drive is busted? Or are drives getting damaged in
transit? Or something else?

------
maccam94
I recommend using Flexible I/O tester (fio) to run burn-in new disks:
[https://fio.readthedocs.io/en/latest/fio_doc.html#running-
fi...](https://fio.readthedocs.io/en/latest/fio_doc.html#running-fio)

It will give you both performance statistics and catch checksum verification
errors.

------
fred_is_fred
There are at least two comments in here about special processes people use to
validate Amazon purchases. This is because you cannot trust that what you are
buying is new and because you cannot trust that it is the advertised size. If
this is true, why not just buy computer equipment somewhere else?

------
proactivesvcs
For hard disk drives I leave the drive aside for 24 hours, before powering it
up, if it's likely to have been in below-zero temperatures before it got to
me.

I run a short SMART test and check the results, as well as attributes, in case
it's faulty already. Then do similar to the OP, afterwards running a short,
conveyance, and long test.

~~~
dijit
> For hard disk drives I leave the drive aside for 24 hours, before powering
> it up, if it's likely to have been in below-zero temperatures before it got
> to me.

Regarding temperatures: it used to be somewhat common practice to freeze
drives that were clicking.

Funny to hear someone waiting to them to thaw, when I have intentionally
frozen a bunch of drives before.

~~~
proactivesvcs
The way I figure: they're fragile as it is, so if I can help make their life
easier, I do :-)

------
HenryBemis
On every new hard drive I always run SpinRite when I hook it up on my
machine(s). I know that for the repair mode it is sloooooooooow, but Steve
Gibson is working on a newer version that would significantly improve speed.
(not affiliated, just save my butt a couple of times).

------
thatoneguy
Ritual I've had since at least 1994: naming a new hard drive volume under MS-
DOS/Windows "PLEASE_WORK"

------
3JPLW
I'm curious about the efficacy here; how many "bad" drives will this find?

~~~
mceachen
I've found several in the past couple years (I test them more rigorously
before I put them into a NAS).

------
numpad0
Would it work with SMR drives?

------
RobertRoberts
[Wrote a comment on the wrong thread... sorry]

~~~
yjftsjthsd-h
Are you on the right thread?

~~~
RobertRoberts
haha, nope. Ty

