

Many SSD Benchmark Reviews Contain Flaws - mrb
http://blog.zorinaq.com/?e=29

======
alabut
What would've been great is if instead of just pointing out the flaws in
testing methodologies, the OP picked out a few publications that do run the
tests accurately. That way we could judge how seriously to take some of the
claims of the X25-M successors, like the X25-E Extreme.

~~~
mrb
Anandtech seems to come out of the pack.

For example they are the only ones who seem to benchmark SandForce-based SSDs
(from OCZ, Crucial, ADATA, etc) correctly by taking care of defeating the
transparent compression and deduplication features of this controller. They
did that by patching IOMeter to use blocks of random bytes instead of constant
bytes.

~~~
avar
On the other hand the SSD is probably doing transparent compression and de-
duplication because it helps with real-word workloads.

If you only test the pathological it isn't informative if your real workloads
could benefit from compression and de-duplication.

~~~
mrb
Anandtech takes care of testing the 2 cases, 100% random data and non-random
data, to give a range of expected performance on real-world workloads.

However they only do this for tests run with IOMeter, because most other
benchmarking tools are unable to write random data. Which brings me back to
one of my points that many benchmarking tools are flawed in the sense they
don't take into account potential deduplication.

------
illumin8
Great article. I would be interested in more detail on how you specifically
benchmark storage. I benchmark a lot of enterprise SAN storage for Oracle
database clusters and I've found a tool called ORION that works surprisingly
well. It tests small random I/Os, large sequential I/Os, or any combination
thereof, and uses the Oracle asynchronous I/O libraries so you can simulate a
real database workload.

The nice thing is that I can plug in my workload mix - 90% read, 10% writes,
point it to all of my raw devices, and it will test all combinations of small
random, large sequential I/Os, multiple queue depths, and give me CSV file
outputs showing my I/O per second, MB per second, and latency for every data
point.

You can also tell it the size of your storage array or disk cache, which it
will add to Linux kernel cache, and pre-warm the cache with /dev/random. For
example, if I'm testing on a server with 64GB of RAM and 8GB of storage cache,
it will pre-warm a total of linux kernel cache + 8GB, so it might pre-warm
60-70GB of data before each data point to get accurate, repeatable results
without the effects of cached reads/writes.

~~~
mrb
I use scripts I wrote myself in combination with Linux CLI tools (dd, iostat).
They give me 6 pieces of information: sustained sequential read and write
throughput (at different LBA offsets), read and write latency (4kB I/O with
queue depth of 1), and random read and write IOPS (4kB I/O with queue depth of
at least 32, and, if necessary, using random data to characterize the
effectiveness of any dedup/compression feature).

That's it. Once I have these 6 synthetic results, that gives me more and
better information than 9 out of 10 benchmarking articles on the Net.

I usually don't need more complex synthetic benchmark results (eg. x% read /
y% write). After that I simply test the real workload.

------
gopher
I've written a cross-platform, pure python benchmark for IOPS; see
[http://benjamin-schweizer.de/measuring-disk-io-
performance.h...](http://benjamin-schweizer.de/measuring-disk-io-
performance.html)

------
lutorm
One could also argue that if it's really that hard to tune your benchmarking
system to get the optimal results it's not really relevant to any real-life
task...

------
marapril
I really think real world app testing is the best way to test these things.
Important things like app launch times, boot times, performance the above
after the SSD has been aged. Average numbers matter way more than peak numbers
as peak numbers may last a fraction of the times as other measured rates.

------
kevinpet
Seems for most purposes, a simplistic benchmark that represents some realistic
scenarios is preferable to benchmarks designed to maximize performance
numbers.

------
stevek
Ouch, it looks like defragging your SSDs became worthwhile again!

