

A File System All Its Own – specialized for SSDs - cnahr
http://queue.acm.org/detail.cfm?id=2463636

======
Sami_Lehtinen
Layering stuff for legacy reasons isn't anything new. It was a smart idea to
connect flat digital displays using ADC and VGA cable and display adapter with
DAC. Still many did it, and some people even doing it today. It doesn't still
make any sense what so ever.

~~~
vy8vWJlco
It makes perfect sense when, say, that's the only cable you have on hand and
when "mostly working for cheap" beats "not working, but technically correct
and optimal." And then there are those to whom "lossy but non-DRM analog"
(VGA) typically beats "lossless digital with DRM capability that will sneak up
on you when you least want it" (HDMI).

~~~
petsos
What the hell am I going to do with a VGA display when the only output I have
is a serial cable? And then why should I pay for a VGA cable? I can connect my
vt100 anywhere with a cable I can make myself.

~~~
vy8vWJlco
I'll bite. I didn't say VGA was cutting egde, or the one true cable, or that
real programmers use butterflies - only that it remains practical for a very
large number of uses. I believe the longevity of VGA has a lot to do with its
compatibility and explicit lack of DRM features. I have often had a choice and
gone with "good enough" VGA when digital was an option, simply because I am
aware that by buying HDMI I am not only financially supporting and licensing
DRM, but am committing to a technology that can be used against me. I may not
be the norm, but I am far from a Luddite.

~~~
shrughes
There are other digital options besides HDMI...

~~~
vy8vWJlco
Yes, DVI was popular there for a while, and there are an assortment of others,
but HDMI seems to have outpaced DVI in my encounters (and the others are quite
niche: some Mac, Intel, etc...). DVI is still attractive, but things are
shipping without it in favor of HDMI (in my experiences), and that only
further helps extend VGA's lifespan. People already have VGA cables, and their
only "upgrade" path to digital is often HDMI, so VGA remains the lesser of
evils based on price and personal lifestyle/ethics.

~~~
LukeShu
My experience is similar, but I'll add that DVI->HDMI cables are cheap, and
mean that you can use DVI even when your displays are HDMI.

------
joe_bleau
The MLC and SLC NAND trends in figure 1 are confusing me. Historically, wasn't
SLC first? Yet the graph shows pricing for MLC back to 2001, and SLC back to
only 2007-ish. It correctly shows that MLC is less expensive than SLC.

Maybe he didn't have old price data for SLC?

~~~
mbjorling
It's hard for people to get data on flash chips / prices without signing an
NDA. He probably found the data from various places and stitched it together
from that.

~~~
joe_bleau
Hmm. I read the chart as price for the entire drive, not the 'raw materials'
of the drive.

------
baruch
The problem is that most users (home & enterprise) just want things to work,
they don't really care much how to get there and to have the best efficiency.

It won't be too hard to have a good filesystem that works over raw NAND flash
but it will not work on older OSes, it will not work in the enterprise storage
market and so there will be less buyers and thus it will cost more so no one
will buy it and it will not be made.

Even the enterprise storage folks just want the damn flash devices to just
work without the storage folks doing anything with them. It's taken to
extremes sometimes and the flash vendors just do whatever they are told since
there is a lot of market in whatever the software-defined engineers want.
Except the engineers mostly want to deal with high level algorithms and to
brag how fast their algorithm is without really thinking about the hardware.
Hardware is hard. Besides they can do something with the hardware that is
already on the market rather than envision something better.

TL;DR unless someone will hold the stick at both ends (software and hardware)
no one will make a reduced layer solution.

~~~
lgeek
> The problem is that most users (home & enterprise) just want things to work,
> they don't really care much how to get there and to have the best
> efficiency.

When it comes to research, no one cares that much about what home users and
enterprise-users-small-enough-not-to-use-custom-software-stacks want _now_.
Case in point: I don't think many IT mangers were that eager to switch to
using ZFS in production when it was announced back in 2004 (and ZFS wad been
under development for years at that time).

I've considered doing a PhD project pushing and stretching the boundaries of
SSD firmware/operating system/filesystems because I think there's a lot of
improvement that can be done in this area. The cost of OpenSSD that the
sibling comments mention wasn't even that much of a problem. I seriously don't
think someone not associated with a research department somewhere would have
the time and/or know-how to do original research _and_ implement a working
prototype. Hell, I might get a devkit, but I doubt I'll do anything
interesting and original at the same time. Which brings us to the actual
problem:

Documentation and NDAs. For lots of ICs, microcontrollers, processors you can
freely get hardware documentation, programming manuals, etc. For flash
controllers and high-density NANDs? Almost nothing at all. Maybe some stuff
can be reverse engineered and you get some documentation for OpenSSD. But the
NAND manufacturer won't tell you stuff that's really, really important about
things like failure patterns, which would allow you to optimize error
correction and wear leveling for example.

~~~
baruch
Documentation is an issue indeed.

I don't think that you really need the inner information about NAND to do the
original and innovative research. It really depends on the area you want to
work on, the SSD firmware level might require that but the SSD makers are
already on that route (some better than others). The other level is to not pay
too much attention to the differences between NAND chips and just implement
something at a higher level to push the hardware-agnostic smarts to the OS.

The OpenSSD also lacks documentation, last time I looked at it there was no
info on how to do NCQ on the SATA interface, without which there is no talking
about a speedy SSD.

------
kalleboo
> Again, this approach today requires a vendor that can assert broad control
> over the whole system—from the file system to the interface, controller, and
> flash media.

Apple would be well-positioned here if they still cared about their Macs. HFS
is due for a replacement anyway after 30 years. (it could be done on iOS
devices too, but flash I/O performance doesn't seem to be a the major
bottleneck for those uses)

~~~
baruch
They actually have all the components under their arm, they bought Anobit
which made fabulous SSDs and are snatching engineers around here.

Anobit SSDs were the best I've seen so far in terms of consistent performance.

~~~
adamleventhal
Agreed that Apple is very well-positioned for this -- the question is whether
they care about the problem. A purpose-built filesystem would improve
performance and longevity at a lower cost. Is it worth it for Apple to invest
in a brand new file system and data path?

------
mav3r1ck
Not quite sure what this article is talking about:

[https://en.wikipedia.org/wiki/List_of_file_systems#File_syst...](https://en.wikipedia.org/wiki/List_of_file_systems#File_systems_optimized_for_flash_memory.2C_solid_state_media)

I personally believe log-based file systems are a perfect match due to never
saving the same file repeatedly to the same location (so provides built in
wear-leveling) and one can optimize writes by always clearing the head of the
log for the next write.

~~~
adamleventhal
That list is specious at best.

Take ZFS. I designed the flash integration for ZFS; it's used as a caching
tier. ZFS is definitely not optimized for use with flash as its primary
backing store. The same is true for some of the other filesystems in the list;
offhand: CASL and WAFL.

Most of the rest are designed for embedded use cases, are research toys, or
are embedded research toys.

------
stcredzero
The memory hierarchy needs to be revised to take into account the different
performance characteristics of Flash RAM vs. hard drives. There is no
disputing that NAND Flash SSD are very different from Dynamic RAM, static RAM,
and HD.

------
jpalomaki
Wouldn't it make sense to use an object storage style interface to SSDs?
Instead of managing sectors and cylinders the SSD would provide interface for
managing objects, pretty much like cloud storage services like S3.

~~~
mbjorling
It's one way to look at it. However, having an object interface toward the SSD
does not solve the problem of variability that the author mentions.

The variability is caused by the "incompatible" NAND flash interface (read,
write and erase), while the IO interface to the host system is read/write (and
occasional trim to let the device know of unused pages). Therefore, another
interface, other than a simple read/write is the holy grail. This interface
might be one that give various guarantees for the user, e.g. atomic
operations, etc. It doesn't need to only be an object / page store.

------
vy8vWJlco
_"Layering the file system translation on top of the flash translation is
inefficient and impedes performance."

"For many years SSDs were almost exclusively built to seamlessly replace hard
drives; they not only supported the same block-device interface"_

The point of storage is to be able to put anything you want on it. That
contract _is_ the block interface, and includes the ability to change the
filesystem. A file with internal structures is also a filesystem. The
interfaces are fine. Change for change's sake should be avoided. (Providing a
bypass, SSD-optimized interface is fine, but, ahem: "put down the crack
pipes"... <https://news.ycombinator.com/item?id=5541063> )

~~~
baruch
I actually agree that using the block interface makes sense for storage but
would have loved to have a minimal interference from the SSD. If it just
exposed the entire flash to address by the user and reported when a block has
problems and maybe some stats about the underlying flash chips I'd be very
happy. It would then enable building better things at the OS/Application side.

There is quite a bit of a chicken-and-egg problem here though, all the current
filesystems basically assume that the underlying media never has any faults
and if it does than the media problems are static and do not develop over
time. This is obviously incorrect for flash, but even for rotating media it
wasn't true. Since every OS requires a fault-free media the SSD vendors are
working hard to make sure they provide a semblance of such fault-free media
they make it harder to provide the best possible performance or a different
trade-off than what they have taken.

~~~
vy8vWJlco
I also want lower-level access, but I would presumably not be using it for
files/reliable storage.

If a lossy interface is acceptable, why couldn't SSDs simply expose a faster
albeit lossy block device and, if necessary, an extended SMART or custom
inspection method, and let the user take responsibility for wear-leveling,
ECC, etc? It would be backwards compatible with other things by virtue of
presenting the standard block interface. A common ECC+wear-leveling middle
layer could evolve allowing use of standard filesystems and a common codebase
for all flash storage, relieving the apparent burden on SSD vendors who would
love to sell fast unreliable storage rather than reliable storage.

I think the chicken-egg problem is just an egg problem though, because even
though I'd love lower level access to the unreliable bits, I have to expect
that the market for unreliable storage is very small, not unlike the high-
efficiency-but-sometimes-exploding toilet. :)

If reliable storage is the primary use-case, maybe the drives just need to be
smarter to keep up. I'd rather an ASIC handle ECC, etc, transparently (for the
same reasons I'd rather have a dedicated GPU) than run ECC (or 3D floating
point software) on my general processor. If you inevitably want reliable
storage and just wind up running ECC, etc, on the CPU, the speed gains
disappear and we're back in something similar to a pre-DMA world with the main
processor doing something that could be done in parallel by a dedicated chip.
If the hard drive is the right place for the offload, I'd rather the economic
pressures remain for the SSD vendors to optimize inside that black box, behind
the standard reliable interface.

That said, again, I too would love finer-grain control.

~~~
baruch
I can see a mix where some parts are done by hardware, ECC comes in there.
Other things should be done in software (FTL, error recovery, RAID).

The block interface itself is actually matching the flash, you
read/write/erase in blocks, they may not be 512 bytes but rather 4k/8k/256k
whatever works for the underlying hardware.

------
hobbes78
Is exFAT no good?

~~~
mbell
In short - No its not a very good file system.

Even ignoring the licensing / patent issues with it, its non-journaled and
only has a single FAT in most implementations; Its easily corrupted and
difficult to repair. It also lacks a number of useful features like pre-
allocation, robust meta-data, etc.

------
frozenport
As a side note, Lustre seems to work fine on SSDs.

~~~
adamleventhal
Everything works "fine" on SSDs! They were designed to drop right into place.

