> In normal testing, 5 Gb/s should yield around 500 MB/s Wow 90% overheads?

RedShift1 · on April 18, 2022

5 gigabit/s = 0.625 gigabyte/s = 625 megabyte/s

so about 20% overhead... I would not be dissatisfied with 500 MB/s result on a theoretical 625 MB/s bus.

masklinn · on April 18, 2022

FWIW it's actually 500MB theoretical (ideal), because USB uses a 8/10 encoding[0], until 3.1 Gen 2 which switches to a 128/132 encoding[1].

So you have a 5Gb/s physical signal, then a 25% encoding overhead, for a raw data throughput of 537MB/s.

Once framing and protocol overhead are taken in account, mid-400s effective is probably a good result.

[0] https://en.wikipedia.org/wiki/8b/10b_encoding

[1] https://en.wikipedia.org/wiki/64b/66b_encoding

amelius · on April 18, 2022

What sucks, however, is that you can transfer 1TB to a USB drive correctly, and then finally have a bit-error that spoils the entire transfer and might even corrupt your drive. Happened to me several times during testing on different machines and with different cables and drives (luckily no real data was lost, but USB is now sort of banned from my office for use with external harddrives). Any sane protocol would be able to deal with such a low error rate, and finish a data transfer without problems.

matja · on April 18, 2022

Use a filesystem that does end-to-end error detection and correction. I use ZFS on all of my USB drives (even "real" SATA and M.2 SSDs via adaptors) and it frequently finds checksum errors (and thanks to the design of ZFS, is able to completely recover without any risk of further corruption).

amelius · on April 18, 2022

Isn't any filesystem that runs on top of a SSD or flash drive at risk from the SSD/flash drive controller corrupting the disk or making it inaccessible, when the volume is not cleanly unmounted?

I can remember losing a USB drive that way, even though it was using a journaling filesystem.

paulmd · on April 18, 2022

Potentially yes, realistically no. This is the old "if the storage controller is lying about flushing to disk, there's not a lot you can do about it" problem.

ZFS does have defensive mechanisms, like doing a read after write to try and be sure that what is written is actually committed to disk, but if the storage controller chooses to serve that out of cache then that could be a lie too. It's the old "trusting trust" predicament, there's no way for hardware higher in the stack to prove that lower levels aren't simply a tower of lies, only instead of viruses it's flush.

That said, in practice very little hardware is actually a giant tower of lies. Flash drives typically do not have enough of a controller to actually cache anything, thus no real capability to lie about writes/etc. SSDs do, but they also generally obey the expected behavior around flushing actually flushing and not just lying about it.

RAID controllers and similar are the danger zone, because they may have cache and may lie to the CPU about it on the assumption that it actually will eventually be flushed, and that's the dangerous thing.

floatboth · on April 19, 2022

More importantly ZFS by design ensures that if you unplug early and lose the last few writes, you'll end up in an earlier consistent state.

amelius · on April 19, 2022

I think the one you are replying to understands this, but the point is that "the last few writes" may be executed in different order by the underlying hardware and this might confuse the filesystem.

amelius · on April 18, 2022

The question remains: how did I lose an entire ext4 volume after pulling the USB drive from the machine without unmounting it.

Is anyone testing for these kinds of things?

floatboth · on April 19, 2022

That's great if you only use these drives on your big computers. Sadly if you do want a "universally" portable drive – usable on an Android phone and an iPad and whatnot – you're stuck with horrible basic filesystems like exfat :(

rasz · on April 18, 2022

This doesnt actually happen (single bit errors getting thru). USB 3.1 data is protected with https://en.wikipedia.org/wiki/Cyclic_redundancy_check#CRC-32... , same CRC as the one used in Ethernet.

https://www.rroij.com/open-access/implementation-of-link-lay...

USB 3 uses CRC-16, still more than good enough for single bit flips.

amelius · on April 18, 2022

Interesting. Perhaps it was more than one bit-flip then ...

Anyway, it still sucks that I couldn't find the exact reason for the error in the logs.

RedShift1 · on April 18, 2022

Are you sure it's a USB problem and not something else? Did you reduce all possible variables - ie how scientific were you in determining USB was the problem and not something else?

amelius · on April 18, 2022

I think I tested quite thoroughly on very recent Linux versions with Rsync. Of course it is difficult to pinpoint the real culprit with certainty, but USB has no error correction, and that makes it the prime suspect here.

amelius · on April 18, 2022

Another bad thing about it is that there is no good way to debug it (afaik).

jon_adler · on April 18, 2022

Bits to bytes means dividing by 8.

doctor_eval · on April 18, 2022

I'm old enough to remember parity bits and FEC, so I know that the number of bits per data byte might be 8, but the number of bits required to transmit a single byte might be more than 8, and the number of bytes needed to transmit a frame of data might also be higher than the number of bytes of payload within the frame.

I know next to nothing about modern serial protocols, but nevertheless as a rule of thumb, and absent any other information, I tend to use 10 bits per byte when converting bits per second to bytes per second.

Saves disappointment if nothing else.

masklinn · on April 18, 2022

You're pretty much on the dot: what you're thinking of is 8b/10b encoding (https://en.wikipedia.org/wiki/8b/10b_encoding) and it's exactly what USB uses historically.

Recent standard revision (USB 3.1 Gen 2, as well as USB 3.2 Gen 2x2) switch to a more efficient 128b/132b encoding (https://en.wikipedia.org/wiki/64b/66b_encoding), with ~3% encoding overhead rather than the historical 25.

tjoff · on April 18, 2022

Just to add for clarity, b is used for bits and B for bytes. Though they are often mixed up.

(... and a byte is not always 8 bits)

mccorrinall · on April 18, 2022

I believe the byte changed its meaning like many words did in the past. To a lot of younger developers a byte is synonymous with an octet.

jon_adler · on April 18, 2022

A byte has been synonymous with an octet for as long as I can remember. I found a Wikipedia article [1] that describes the transition of a byte meaning the number of bits to encode character of text to an octet.

[1] https://en.wikipedia.org/wiki/Units_of_information

tjoff · on April 18, 2022

It does depend on context though. Transports often use 8b/10b, which is why SATA 6Gb/s is also 600 MB/s.

https://en.wikipedia.org/wiki/8b/10b_encoding

A byte is also referred to the smallest addressable unit. Which in most cases is 8 bits but many (I guess mostly niche architectures) have much larger characters/bytes, and is still very much relevant.

mccorrinall · on April 19, 2022

I love your comments! Always learning something new. (this time about SATA)

saagarjha · on April 18, 2022

In C it’s not when CHAR_BIT != 8.

can16358p · on April 18, 2022

I just had to google "is byte always 8 bits" so see that in historical context it wasn't always the case.

But probably for all practical purposes in modern era, we can safely assume it's always 8 bits.

JadedBlueEyes · on April 18, 2022

In the CS tests for every UK exam board, a byte is always 8 bits

huhtenberg · on April 18, 2022

Unless you are a disk drive vendor, in which case you insist it's 10.

can16358p · on April 18, 2022

Why? Assuming a byte being 10 bits instead of 8 bits would actually make the drive capacity lower in terms of bytes (of course not physically but for marketing), which a vendor probably doesn't want.

rileymat2 · on April 18, 2022

I am not sure what the op is talking about with bytes, but it could be powers of 2 v. powers of ten for kb/mb/gb.

sigio · on April 18, 2022

Yeah, that's the kibibytes vs kilobytes / gibibytes vs gigabytes diffence... Which adds up to ~7% difference in capacity when talking about GiB/GB, and even 12.5% when talking about PiB/PB

Kibibyte (KiB)

1,024¹ = 1,024

Mebibyte (MiB)

1,024² = 1,048,576

Gibibyte (GiB)

1,024³ = 1,073,741,824

Tebibyte (TiB)

1,024⁴ = 1,099,511,627,776

Pebibyte (PiB)

1,024⁵ = 1,125,899,906,842,62