
File systems unfit as distributed storage back ends: 10 years of Ceph - mpweiher
http://muratbuffalo.blogspot.com/2019/11/sosp19-file-systems-unfit-as.html
======
toyg
Something that is not mentioned, though, is that all the years patching
existing FS were an _education_ in how a FS works. They likely wouldn’t have
had enough knowledge to go balls-out on a new FS at the start. So that 2-year
effort was really much longer.

------
wging
See also the Adrian Colyer discussion (if you have deja vu, this is why, but
it's not a repeat of that post):
[https://news.ycombinator.com/item?id=21460759](https://news.ycombinator.com/item?id=21460759)

------
tyingq
I found the story of how Backblaze stores things "on top of", versus "in"
filesystems similarly interesting. [https://www.backblaze.com/blog/vault-
cloud-storage-architect...](https://www.backblaze.com/blog/vault-cloud-
storage-architecture/)

It seems like they could go one further and eliminate the ext4 underneath.

~~~
ignoramous
You'd love this: [https://maisonbisson.com/post/object-storage-prior-art-
and-l...](https://maisonbisson.com/post/object-storage-prior-art-and-lit-
review/)

Talks about Facebook, Instagram, S3 and other Object Store services and how
they deal with storage at scale.

~~~
zod50
thanks for the link!

------
Mave83
Ceph is awesome, even years ago it was a great Technology. We from croit.io do
provide a free software to manage Ceph with ease.

~~~
core-questions
Looks slick, you got downvoted because you dared to promote something but this
actually looks like a reasonable value-add. Ceph is one of those things that's
just a bit risky for orgs without the subject specific expertise.

------
Ericson2314
Yes it really cannot emphasized enough that the legacy filesystem system
interface with it's too-simple 1970s origin and then far, far, far too complex
decades of duck tape is a disasterous albatros.

C.f. What linus is saying in
[https://news.ycombinator.com/item?id=21673372](https://news.ycombinator.com/item?id=21673372)
except turn it around. When an interface has devolved into two sides hating
and Postel's-law-enabling each other ad infinitum, and a statement like his is
actually justifiable, it's time to close up shop and move on. Nothing good
will ever come from POSIX-like storage ever again, and any storage system
built around it is doomed to be a mess of too many layers and also too many
layer violations. Utter hopelessness.

~~~
bsder
> Yes it really cannot emphasized enough that the legacy filesystem system
> interface with it's too-simple 1970s origin and then far, far, far too
> complex decades of duck tape is a disasterous albatros.

Except that nobody will sign on.

Look at what happened to FreeBSD in the 5.0 timeframe when they reworked their
storage layers into GEOM. It was a _NIGHTMARE_. Most people agreed it needed
to be done, but there was an _excruciatingly_ loud segment who complained
incessantly. It took some gigantic brass balls and asbestos-lined flamesuits
on the part of FreeBSD heavy hitters to drive it through.

If the system in Linux is to get fixed, _Linus_ would probably have to step in
and pronounce.

~~~
mpweiher
Maybe the OS is not the right layer for this?

~~~
mbreese
What other layer could it be in? (Legitimately curious)

~~~
valenciarose
In userland, for one. Or an unprivileged service in a microkernel O/S. There
are a lot of concerns jammed into the current concept of filesystem.

------
pdimitar
This all makes me grateful that I use sqlite3 instead of FS for storage, even
for fairly trivial projects.

~~~
limomium
Could you expand a little on how you're doing that?

I've been thinking about transitioning entirely to sqlite for all my data.

~~~
zippie
The GP may seem like sarcasm to some ... sqlite is an overlooked, novel, and
faster way (up to 35%!) to store things than the filesystem [0].

You can use something like libsqlfs [1] for POSIX file heuristics with sqlite
as the backing store.

One HA single primary/multi-master solution to use sqlite may be drbd.

[0]
[https://www.sqlite.org/fasterthanfs.html](https://www.sqlite.org/fasterthanfs.html)
[1]
[https://github.com/guardianproject/libsqlfs](https://github.com/guardianproject/libsqlfs)

------
alexnewman
Brilliant, but I wonder now that we know how filesystems work if we could
redesign ceph to do the right thing. For instance a lot of work was made to
schedule the important writes at the right time. Perhaps they could have
handled these latency issues explicitly.

------
kzrdude
Was it Ceph they were using at CERN (ATLAS Project, at least?) they were using
some kind of file system federation.

~~~
tyingq
Googling isn't much help. You can find references to AFS, DFS, VM-FS, and
EOS...all being used at CERN.

~~~
kzrdude
Oh I see, I was thinking of AFS, I'm sure, I thought it was built on ceph or
vice versa.

------
ddtaylor
Hi! This could have been submitted as an HTTPS link.

~~~
3fe9a03ccd14ca5
I assume if the owner of the site wanted to redirect all http->https traffic
they would do so.

~~~
ddtaylor
Doing that doesn't actually solve the problem though. A MITM attacker still
gets to read and modify all that content.

~~~
tyingq
I don't believe blogspot allows you to turn off http, just (optionally)
redirect it to https.

------
MichaelMoser123
But ceph has the bluestore backend that doesn't go through the file system.

~~~
mbreese
That was the punchline of the post... file systems add too much extra
overhead, so they wrote a storage backend without the file system.

------
austincheney
I am solving for this problem right now and my solution is working great cross
OS. The file system is not the files contained by that system. The thing that
effects performance is the CPU time for compression.

------
ragerino
MapR-FS is a great distributed file system, which solves tons of challenges
out of the box. E.g. HA, POSIX, NFS, multi tenancy, multi-temprature, co-
location, and sexurity.

