
SSDs and distributed data systems - boredandroid
http://blog.empathybox.com/post/24415262152/ssds-and-distributed-data-systems
======
Dave_Rosenthal
I'll add a few things (I've done a lot of SSD testing lately) as SSD
performance is highly dependent on factors that don't influence hard disks:

0) Sequential performance is easy. If that's what you want, SSDs work pretty
well and you can skip this post. The below points are for random IO.

1) Almost all SSDs are significantly slower at mixed read/write workloads (as
a database would do) that either just reads or just writes. Sometimes as much
as 1/4 the speed!

2) Random I/O throughput, especially for writes, is highly dependent on how
full the disk is (assuming TRIM). For example, a 50% full is usually pretty
fast, an 80% full disk is getting slower, and a 95% full disk is dog slow.

3) I have seen SSD controller and firmware version drastically impact
performance. A recent firmware "upgrade" halved IOps on 100 of our SSDs (glad
we tested that one...)

4) Time dependence! Many of my heavy I/O tests took 8+ hours to stabilize
(often with very odd transition modes in between). Don't run a 30 second test
and assume that's how fast the disk will stay.

5) Lastly, have many outstanding IOs in the queue if you want good IOps
throughput. Think 32+.

My recommendation overall: Test your actual application workload for 24 hours.
Use a Vertex 4 with firmware 1.4 less than 75% full for your mixed read/write
workload needs!

------
sounds
I'd really like to see x86_64 servers that use direct memory-mapped IO to the
SSD. Some add-in PCI cards already do (can't find the link right now). It
would really take Intel jumping on board for it to become the standard though.

More particularly, right now, assuming the app wants something from memory-
mapped file, the process is:

    
    
      1. app attempts to read some bytes
      2. CPU page fault triggers a read by the kernel to get them from disk
      3. kernel block layer locates bytes on disk
      4. block read request submitted to storage subsystem
      5. read request (probably merged with others, e.g. readahead) submitted to SATA controller
      6. ATA command decoded by SSD
      7. bytes sent back up the chain
      (skipping the return trip up the chain for brevity)
    

Some PCI SSDs already make it possible for the kind of improvement that looks
like this:

    
    
      1. app opens a file
         blocks mapped into app address space as read-only

~~~
wmf
_I'd really like to see x86_64 servers that use direct memory-mapped IO to the
SSD._

It's not clear whether this is as good an idea as people think. Flash is fast,
but the difference between ~20 us and ~50 ns is still huge. You could end up
wasting a lot of cycles while the processor is stalled. Also, there's no way
for memory to report errors short of a machine check that (if you're lucky)
kills the process.

Fusion io is working on an intermediate approach that bypasses the OS but
doesn't try to treat flash as memory.

------
mmagin
I find myself in disagreement with the idea that read/write cycles for
(rotating) hard disks are unlimited. While it may be much higher, and a 'per
block' characterization is largely improper -- the thing that fails is not the
block itself but the moving parts -- when I worked in a job where we had tons
of cheap consumer-grade hard disks, they did seem to fail after a few hundred
complete passes over the disk.

~~~
spydum
Pretty sure google published a paper on disk failures that can begin to refute
that assumption: <http://research.google.com/archive/disk_failures.pdf>

------
crazygringo
I remember RethinkDB was tackling this... if I recall correctly, they had been
developing a new MySQL engine specifically designed for SSD's, but looking at
their site now it seems they gave that up and changed it to a memcache
protocol to focus on NoSQL?

Has anybody ever used it, or know of other companies working on similar ideas?

~~~
strlen
Tokutek uses Fractal Trees which have several SSD-friendly properties:
<http://www.tokutek.com/>. Unfortunately, I don't have any first hand
experience with it.

If you want an interesting challenge, you could also try implementing a MySQL
storage engine using LevelDB. Obviously basic implementation shouldn't be too
difficult, but getting good performance and reliability would require some
effort.

