
File systems unfit as distributed storage back ends: lessons from Ceph evolution - r4um
https://blog.acolyer.org/2019/11/06/ceph-evolution/
======
j-pb
From the Blog post:

> We looked at this issue earlier. Fundamentally the tension here is that
> copy-on-write semantics don’t fit with the emerging zone interface
> semantics.

While the paper writes:

> It is not surprising that attempts to modify production file systems, such
> as XFS and ext4, to work with the zone interface have so far been
> unsuccessful [19, 68], primarily because these are overwrite file systems,
> whereas the zone interface requires a copy-on-write approach to data
> management

This seems to be a contradiction, and I'd side with the original paper.

~~~
ZoomZoomZoom
Yes, found it strange too. The third quote block directly states that copy-on-
write design is encouraged by the current hardware.

~~~
psds2
Zone interface is specifically good at log structured CoW is what it says.
Later they talk about how they tried using a write ahead log but that it does
not work well for managing metadata in a distributed filesystem.

------
angrygoat
I ran a ~ 0.5 PB Ceph cluster for a few years, on quite old spinning disk
hardware (bought second-hand.) It was great: it just worked, coped very well
with hardware failures, told the operator what was happening. An extremely
solid, well-engineered system. My thanks to the Ceph team :)

~~~
yankcrime
Yup, adding to this: For about four years we ran a similar size Ceph cluster
that supported a burgeoning public cloud platform with users making extensive
use of block and object. We did this on a range of hardware from creaky
second-hand SuperMicro boxes to newer all-flash Quanta-based hardware.

Through questionable hardware selection as well as standard operational
challenges such as upgrades and scaling, Ceph never let us down.

The only time we had a major problem turned out to be our fault. The creaky
machines we were using at the start 'lost' half of their memory during a round
of power failure testing. We didn't have monitoring in place to spot this, and
unfortunately it manifest after we lost a node which triggered a significant
rebalancing operation across the cluster. With several machines missing 50% of
their RAM, this quickly descended into a horrendous disk thrashing exercise.
Again, all credit due to Ceph - we were able to coax the cluster back into
life with no data loss.

------
gambler
I was just discussing with a colleague how technology accretes and how no one
reevaluates high-level design decisions even after every single factor leading
to those decisions has changed.

It's weird that basic filesystems today are so out of touch with modern
realities that we are _universally_ forced to resort to using complex
databases even in cases when the logical model of files and directories fits
the storage needs really well.

It's weird that hierarchical storage is the only universal model available on
all OSes and in all languages.

The more I think about it, the more I realize that we live in a bizarro world
where software runs everything, yet makes little to no sense from either
human, or modern hardware or system design perspectives.

~~~
nostrademons
Programming languages are in the same boat - the modern CPU works dramatically
different from a PDP-11, but most programming models still assume an
accumulator machine; flat memory hierarchy; effectively unbounded LIFO
program-control stacks; uniform machine word sizes; sequential in-order
execution; and byte streams for I/O. This is despite large register files,
multiple levels of caching, coroutines/promises, SIMD, multicore/GPU, and
page-level I/O being things. In many cases even assembly language encodes
assumptions from the 1970s, and then is internally translated by the processor
into how the hardware actually works.

~~~
h0l0cube
I wonder what your thoughts are on the Mill Architecture? It's throwing out
every assumption on how to build a CPU, which mandates that the compiler needs
to be rewritten to generate code for it.

~~~
nostrademons
I hadn't heard of it before. I just looked it up and it looks interesting, but
I don't have enough hardware engineering experience to judge its feasibility.

I think a larger problem with new architectures is that consumer adoption
follows price/performance/power, not any inherent architectural quality. The
architecture we're stuck with is the one that most hardware devices get sold
with (x86/ARM right now). That gets determined by hardware OEMs, which in turn
make their decisions based on what'll help them sell the fastest devices with
the lowest power consumption for the least money. So something like RISC V is
fascinating and quite elegant, but until there are RISC V chips that are
cheaper and faster than Intel ones, it remains an academic curiosity. Then if
you're a compiler writer, you gotta work with what you've got for an installed
base, and you can't really get adoption for a new language unless it lets
startup founders unlock new markets because your combination of development
velocity + execution speed lets them do things they wouldn't otherwise be able
to.

~~~
h0l0cube
Ah yeah, I'm just as cynical in all the ways you are about the prospects of a
'new architecture' becoming relevant. Just thought you might have known Mills
+ had some personal insights on the project.

------
wmitty
The link to the paper in the article requires an ACM subscription. Here is a
link to the version hosted by the authors:

[https://www.pdl.cmu.edu/PDL-FTP/Storage/ceph-exp-
sosp19.pdf](https://www.pdl.cmu.edu/PDL-FTP/Storage/ceph-exp-sosp19.pdf)

~~~
dooglius
I loaded it just fine from the link

------
yankcrime
For anyone looking for more information and benchmarks on the performance
improvements in recent versions of Ceph (and with Bluestore in particular),
here's a write-up that was done as part of testing for infrastructure to
support the Human Brain Project: [https://www.stackhpc.com/ceph-on-the-brain-
a-year-with-the-h...](https://www.stackhpc.com/ceph-on-the-brain-a-year-with-
the-human-brain-project.html)

~~~
marmaduke
Unfortunately Julia was just a prototype system that they are going to turn
off. KNL made it into one of their main production systems (Jureca) as a
"booster" module, but they use GPFS on the main storage system (IIRC).

------
zzzcpan
I don't think the conventional wisdom of building on top of filesystems
exists. In distributed systems you always naturally gravitate towards using
raw storage devices instead of filesystems, it becomes obvious very early on
that filesystems suck too much and only create problems. And it's the same
with all the embedded database libraries, you really want to write your own,
because none of the existing ones were made to address performance and
operational problems that arise even in small distributed systems. But at the
same time early on you don't yet know most of the problems and don't want to
invest time implementing something you don't yet understand well enough, so
you end up building on top of filesystems and embedded databases and making
plenty of poor choices and learning on your mistakes.

~~~
notacoward
As I pointed out to the authors, this wheel has turned a couple of times. In
the late 90s, many distributed filesystems (and most cluster filesystems) used
raw disks and their own format. This was a burden, both for the developers who
had to maintain an entire low-level I/O stack in addition to the distributed
parts and to the users who had to learn new tools to deal with these "alien"
disks in their system (which also limited deployment flexibility). Thus, when
the current crop - e.g. Ceph, Gluster, PVFS2 - came around, they went toward a
more local-FS-based approach. All of the issues mentioned in the paper were
still real, but on the hardware of the time (both disks and networks) those
weren't the bottlenecks anyway so the convenience was worth it. Now the
tradeoffs have shifted again, and so have the solutions.

Context: I've been an originator/maintainer for multiple projects in this
space, and currently work on a storage system where we add space in bigger
increments than Ceph's entire worldwide installed base (according to numbers
in the paper).

~~~
PaulHoule
The filesystem as an abstraction has been getting long in the tooth for a long
time -- it's taken a long time for the industry to recognize this, but that's
the way it is with filesystems because a boring filesystem that never munches
your data is preferable to most people to an interesting filesystem.

Some evidence:

* The proliferation of Linux filesystems such as ext4, XFS, ZFS, btrfs, reiserfs, JFS, JFFS, bcachefs, etc. If any of those filesystems were truly adequate there wouldn't have to be so many.

* Microsoft's failure to replace NTFS with ReFS. (Interacts with storage spaces in such a way that it will never be reliable)

* Microsoft's giving up on the old WSL and replacing it with a virtualized linux kernel because metadata operations on NTFS are terribly slow compared to Linux but people don't usually notice until they try to use an NTFS volume as if it was an ext4 volume.

* The popularity of object stores, systems like Ceph, S3, etc.

* Numerous filesystems promising transaction support and then backing away from it (btrfs as mentioned in the article, NTFS, exFAT, etc.)

* Proliferation of APIs to access filesystems more efficiently. On one hand there are asyncio access methods for filesystems which aren't quite as solid as asyncio for networks, on the other hand there is mmap which can cause your thread to block not only when you make an I/O call but later when you access memory.

* Recently some filesystem based APIs that address the real issues (e.g. pwrite, pread) but astonishingly late in the game.

~~~
binarycrusader
_The proliferation of Linux filesystems such as ext4, XFS, ZFS, btrfs,
reiserfs, JFS, JFFS, bcachefs, etc. If any of those filesystems were truly
adequate there wouldn 't have to be so many._

First of all ZFS is not a “linux” filesystem. Second, available choice of
filesystem is a strange thing to use as justification that previously existing
filesystems are “inadequate”. By what criteria are you establishing adequacy?
Success of technologies is often not dependent solely upon their technical
excellence but on a variety of factors that may have nothing to do with the
technology itself. (Betamax vs VHS, etc.)

~~~
PaulHoule
In theory people would pick the filesystem which is ideal for their
application.

In practice it is a lot of work to research the choices, and there's a high
risk that you'll discover something wrong with your filesystem only when it is
too late.

It's one thing to pontificate for and against particular file systems, it's
another to use one for years, terabytes, etc. ZFS might scrub your data to
protect against bitrot, but I remember reading harrowing tales from ZFS
enthusiasts who were recovering (or not recovering) from wrecks every week but
seemed to think it was a lot of fun, or conferred them status, or was
otherwise a good thing.

I stuck with ext4 for a long time before finally building a server that uses
ZFS/HDD for media storage (e.g. not a lot of random access)

I remember the time when a project I was involved with chose reiserfs because
they thought it was "better" and then they were shocked when once in a while
the system crash and we found that a file that had just been created was full
of junk.

That's a trade-off they made, they decided it was important to journal
filesystem metadata (the length of the file is right) but not to protect the
content. If they read the docs, really thought about it, then understood it,
they would have known, but they didn't.

This book points out that in cases where there is too much competition, you
can switch all you like between flawed alternatives, but have no way to
communicate what you really want:

[https://en.wikipedia.org/wiki/Exit,_Voice,_and_Loyalty](https://en.wikipedia.org/wiki/Exit,_Voice,_and_Loyalty)

And when it comes to the "Filesystem API" probably anybody who has special
needs for filesystem performance would find that a different API than the
standard filesystem API would be a boon.

~~~
binarycrusader
I don't disagree with the general premise that a single filesystem might not
be applicable or appropriate to all cases or that existing filesystems apis
are generally deficient.

My primary issue was with the two specific assertions I addressed, one of a
given filesystem's origins and that of choice should being used as a proxy for
evaluation of adequacy.

As for ZFS enthusiasts "recovering from wrecks every week", I suspect you're
specifically referring to ZFS on Linux or one of the BSDs -- which is not the
same as ZFS when used in its original environment -- Solaris.

~~~
PaulHoule
No, it was on Solaris back when ZFS was new.

It seemed like these people enjoyed having wrecks, like the Plebe who enjoyed
getting hazed in

[https://www.amazon.com/Sense-Honor-Bluejacket-
Books/dp/15575...](https://www.amazon.com/Sense-Honor-Bluejacket-
Books/dp/1557509174)

~~~
seized
So because a file system had issues 20 years ago when it was new.... Do you
still drive a car with a carburetor and drum brakes?

ZFS is now incredibly stable and durable, with the exception of some of the
early non production ZFS on Linux work that is now fixed (and was specifically
billed as non prod use). It has seen me through issues that other file systems
would have failed on, including drive failures, hard shutdowns, a bad RAM
module, SAS card being fried by a CPU water cooler, etc. Years and terabytes
just on my systems, zero issues.

In fact one of the tests that Sun did back in the day was to write huge
amounts of data to a NAS and pull the power cord mid writing. Then repeated
that a few thousand times. It never corrupted the file system.

------
gwern
The end-to-end principle strikes again: lowest common denominator abstractions
like filesystems are often incorrect, inefficient, or both for complex
applications and ultimately must be bypassed by custom abstractions tailored
for the application.

~~~
anfilt
It's kinda why I think something like an exokernel would make more sense than
way OSes trying to abstract things to a certain level. We should be trying to
build abstractions that can be peeled layer by layer like an onion, not a
potato.

~~~
kragen
Unix files aren't quite an exokernel-style "just securely multiplex the disk"
but in some ways they come closer than some other alternatives. No file types,
just randomly accessible bytes (well, and an execute bit); no multiple
streams; no ISAM, just bytes (except that you do have directories); no
insertion (but you do have append and truncate).

You could make them _more_ disk-like by making them fixed-size, with the size
specified at creation time, and accessing them in blocks rather than in bytes.
Would those be an improvement? I tend to think not. Certainly if those were
the semantics provided by the kernel you would want userland filesystem
processes to provide appendable files.

Copy-on-write file versioning and cross-file, cross-process transactions, on
the other hand, could be real pluses. I'd be okay with those being provided by
userland processes rather than the kernel, but I'd sure like to have them.

------
throw0101a
We're actually facing an issue with our Ceph infrastructure in the 'upgrade'
from FileStore to BlueStore: the loss of use of our SSDs.

We created our infrastructure with a bunch of hardware that had HDDs for bulk
storage and an SSD for async I/O and intent log stuff.

The problem is that BlueStore does not seem to have any use for off-to-the-
side SSDs AFA(we)CT. So we're left with a bunch hardware that may not be as
performant under the new BlueStore world order.

The Ceph mailing list consensus seems to be "don't buy SSDs, but rather buy
more spindles for more independent OSDs". That's fine for future purchases,
but we have a whole bunch of gear designed for the Old Way of doing things. We
could leave things be and continue using FileStore, but it seems the Path
Forward is BlueStore.

Some of us do not need the speed of an all-SSD setup, but perhaps want
something a little faster than only-HDDs. We're playing with benchmarks now to
see how much worse the latency is with BlueStore+no-SSD, and whether the
latency is good enough for us as-is.

Any new storage design that cannot handle an "hybrid" configuration of
combining HDDs and SSDs is silly IMHO.

I joked that we could tie the HDDs together using ZFS zvol, with the SSD as
the ZIL, and point the OSD(s) there.

~~~
Nullabillity
From what I can tell, BlueStore still supports using separate disks for the
WAL/RocksDB[0]?

[0]:
[https://docs.ceph.com/docs/master/rados/configuration/bluest...](https://docs.ceph.com/docs/master/rados/configuration/bluestore-
config-ref/#devices)

~~~
throw0101a
Yes, the confounding factor was/is we are using ceph-ansible with ceph-disk.
If we want to upgrade we have to make a whole bunch of inter-related changes.

(I'm not the lead on the project, so no doubt have forgotten some of the exact
complications involved. Though it does seem slight strange (IMHO) that you're
getting rid of the file system, but still keep the LVM layer.)

------
rsync
I have sympathy with, and am open-minded to, the conclusions of this article -
even as a die-hard, true believer in the filesystem (esp. ZFS) as a useful
foundational building block.

However, I hope that these conclusions do not lead to the intentional
deprecation of support for filesystems in projects like Ceph. If a non-
filesystem backing store is superior, then by all means do it, but I hope the
ability to deploy a filesystem-backed endpoint will be retained.

In a pinch, it's very flexible and there are a lot of them lying around ...

~~~
amluto
If nothing else, most filesystems can expose a big file that can be used, with
decent efficiency, as though it was a disk. For COW filesystems, turning off
COW on the file may improve performance, and, if the FS is on RAID, the
resulting performance and correctness properties will be odd.

~~~
pm7
But if services will need to use one big file, we won't have possibility to
overcommit disk space (unless they implement something like TRIM).

------
ph2082
> “For its next-generation backend, the Ceph community is exploring techniques
> that reduce the CPU consumption, such as minimizing data serialization-
> deserialization, and using the SeaStar framework with a shared-nothing
> model…“

Seastart HTTPD throughput as mentioned on their site Between 5 to 10 CPU, it
can achieve 2,000,000 HTTP request/sec. Just Vow. But If you look at Http
Performance data on below URL, running similar configuration on clouds (AWS
etc.) looks costly though.

[http://seastar.io/http-performance/](http://seastar.io/http-performance/)

I wonder what would be cost of achieving similar performance on Hadoop stack.

~~~
aasasd
> _it can achieve 2,000,000 HTTP request /sec_

Is that including some db operations? Because otherwise eeeeh:
[https://www.techempower.com/benchmarks/#section=data-r18&hw=...](https://www.techempower.com/benchmarks/#section=data-r18&hw=ph&test=plaintext)

~~~
pas
Ceph has a lot of small ops, where lock and cache contention becomes very
significant. (Basically a small piece of data/request comes in from the
network and the OSD [object storage daemon] network thread has to pass it to
the I/O worker thread and then forget it. The I/O thread similarly just needs
to get the request issue a read/write, and let the kernel work.)

Since the whole Ceph I/O model is async the less waiting, scheduling,
contention, etc. happens the better.

Currently Ceph is CPU bound, that's why they are trying to improve CPU perf.

------
tannhaeuser
This isn't surprising, but I guess the results need to be put into perspective
with the use case for distributed file systems and NFS eg. reasonable scaling
for static asset serving with excellent modularity, in particular when paired
with node-local caches. Of course Ceph etc. won't scale to Google Search and
Facebook levels, but it's still damn practical if you're scaling out from a
single HTTP server to a load-balanced cluster of those without having to bring
in whole new I/O infrastructures. And they help with cloud vendor lock-in as
well; for example you can use CephFS on DO, OVH, and other providers.

------
ngrilly
I read DigitalOcean is using Ceph for its object and block storage. Do they
use BlueStore in production too?

~~~
lathiat
Fairly sure at a Ceph Days recently they did a talk mentioning they’re on
Jewel which would still be FileStore.

BlueStore is really only reaching production deployments and maturity quite
recently last year or so.

~~~
ngrilly
Thanks! This probably means we can expect some IOPS and latency improvements
when they'll upgrade :)

~~~
kklimonda
DO has upgraded their Ceph cluster to luminous back in 2018 according to this
blog post: [https://blog.digitalocean.com/block-storage-volume-
performan...](https://blog.digitalocean.com/block-storage-volume-performance-
burst/)

~~~
ngrilly
Thanks for the link. I missed that. It looks like they use BlueStore :)

~~~
lathiat
My mistake I think I’m confusing them with another provider. Maybe OVH?

------
cfors
Slightly off topic but I love Adrian Coyler’s blog. Since I never pursued
graduate studies in CS I never really got into reading research papers but
would love to start reading some on my commute to work.

Does anyone have any recommendations for finding interesting papers? Do I need
to buy subscriptions? Is there a list of “recommended” papers to read like we
have with programming literature e.g. _The Pragmatic Programmer_?

~~~
heinrichhartman
ACM subscription helps, since their "Digital Library" has pretty good coverage
of the CS literature. This is something that your employer might want to
sponsor.

A strategy to find interesting papers that are worth reading as a novice in
the field, is to pick any current paper that is interesting to you and look
for heavily cited references. The following websites have citation data for
papers:

[1] SemanticScholar.com

[2] GoogleScholar.com

[3] citeseerx.ist.psu.edu

------
newnewpdro
Don't most of their problems go away when you fallocate a pile of space and
use AIO+O_DIRECT like a database to get the buffer cache and most of the
filesystem out of the way?

CoW filesystems like BTRFS proviode ioctls to disable CoW as well, which would
be useful here when you've grown your own.

XFS has supported shutting off metadata like ctime/mtime updates for ages.

If you jump through some hoops, with a fully allocated file, you can get a
file on a filesystem to behave very closely to a bare block store.

~~~
notacoward
> Don't most of their problems go away

No. Yes, you can fallocate. Yes, you can use AIO, or even better io_uring.
Yes, you can use O_DIRECT ... if you want to give up caching and have to deal
with alignment restrictions, and it turns out that O_DIRECT turns into O_SYNC
in some unexpected edge cases. That gets you something that kind of sort of
behaves like a block device, but that doesn't solve any of the problems the
authors identify.

* No transactional semantics. Roll your own. (I'm actually OK with this one BTW, but others feel differently.)

* No help with slow metadata operations. Still slow if you're still using multiple files, or roll your own within a single file.

* No improvement in support for new media types (e.g. shingled/zoned).

In an ideal world, local filesystems would do a decent job supporting
distributed filesystems (and other data stores). Instead, we're in a world
where local filesystems fall short in many ways, and the solution to every
deficiency is to avoid the local filesystem as much as possible. That way
leads to silos and lock-in, so I don't think it's a good answer. Local
filesystems need to be better, or someone needs to create an equally standard
abstraction and set of tools to do what local filesytems can't.

------
bullen
I use ext4 with my distributed async-to-async JSON database:
[https://github.com/tinspin/rupy/wiki/Storage](https://github.com/tinspin/rupy/wiki/Storage)

You can try it here: [http://root.rupy.se](http://root.rupy.se)

The actual syncing is done over HTTP with Java though so maybe that's why it
works well for me.

------
Mathnerd314
I guess a minimal install of Ceph is a client and a node with 2-3 hard disks /
SSDs.

------
lightedman
Unfit? That's why I'm using KirbyCMS atop NTFS with zero issues, right?

~~~
lightedman
[https://news.ycombinator.com/item?id=13679424](https://news.ycombinator.com/item?id=13679424)

Older thread with people having discussed this that back up my words right up
top. I don't have a single issue with KirbyCMS + NTFS, I have distributed back
end stuff as I desire, and it just works and has mature documentation.

