
TFS: A file system built for performance, space efficiency, and scalability - vog
https://github.com/redox-os/tfs
======
dap
I think one of ZFS's most significant contributions was embracing the specific
ways in which disks and HBAs often fail and then building mechanisms to ensure
data integrity in the face of those failures. Bit rot and phantom writes are
the filesystem's problem, even though they're not the filesystem's fault. ZFS
did a lot of work to ensure that integrity: storing checksums in parent
blocks, storing metadata redundantly, fixing bad copies with the good ones
when corruption is detected, and scrubbing. In many filesystems still in use
today, applications can easily receive garbage data from the system.

I understand this filesystem is still nascent, but shouldn't data integrity at
least be one of the design goals?

~~~
ticki_
> I understand this filesystem is still nascent, but shouldn't data integrity
> at least be one of the design goals?

What makes you think it isn't? It definitely is. In fact, it borrows several
ideas from ZFS wrt/ integrity.

For example, it uses parent block checksums like ZFS.

~~~
dap
> What makes you think it isn't?

The first section in the README is called "Design goals", with 13 items. None
of them is "data integrity", and none of them even talks about validating the
data or handling any failures aside from power loss.

By contrast, in the canonical slide deck on ZFS[1], the first slide talks
about "provable end-to-end data integrity". In the paper[2], "design
principles" section 2.6 is "error detection and correction".

I'm glad to hear that's also a focus for TFS. With ZFS, the emphasis on data
integrity resulted in significant architectural choices -- I'm not sure it's
something that can just be bolted on later. As a reader, I wouldn't have
assumed TFS had the same emphasis. I think it's pretty valuable to spell this
out early and clearly, with details, because it's actually quite a
differentiator compared with most other systems.

[1]
[https://wiki.illumos.org/download/attachments/1146951/zfs_la...](https://wiki.illumos.org/download/attachments/1146951/zfs_last.pdf)

[2]
[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.184...](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.184.3704&rep=rep1&type=pdf)

~~~
ticki_
> The first section in the README is called "Design goals", with 13 items.
> None of them is "data integrity", and none of them even talks about
> validating the data or handling any failures aside from power loss.

Fair enough.

------
Nokinside
Good luck. There is long road ahead.

Both ZFS and Btrfs were initially developed by really high caliber people and
experts with good track record. ZFS had five years for full time development
until release, next five years to get close to the features and stability that
ZFS has now. Btrfs started 10 years ago and it's still trying to catch up.

~~~
bluejekyll
Has anyone done an analysis of the number of logic bugs in ZFS and btrfs vs.
memory safety or concurrent updates to memory, etc.?

Also, I'll point out that Apple just dropped a new FS on millions of devices,
with no issues... that was developed in 3 or 4 years. I'm still blown away
that they pulled that off.

~~~
pas
They have full control of that environment though. So they tested it on a few
hundred devices, and by that exhausted all the possible configurations. And
when all worked, they knew they have a pretty good indicator for successful
deployment on all those millions of devices.

~~~
bluejekyll
Why downplay it?

It's an impressive feat, regardless of the differences in target devices. Even
with the hardware configurations well known, the fact that it was done at such
a large scale successfully means that even unusual edge conditions didn't crop
up.

This shouldn't be downplayed, it actually speaks to why it's important to have
incremental stages of software delivery. First target highly constrained
environments (iOS, Watch OS, tvOS), then work on the more difficult and less
constrained general computing environment.

~~~
dom0
It's absolutely an impressive feat to pull this off. But it's not quite the
same problem as building a robust general-purpose FS for a diverse ecosystem.

~~~
derefr
True; and yet, it feels like there is much low-hanging fruit left in
filesystens that _are_ just built for specific vertically-integrated use-
cases. A NAS hardware-appliance company, for example, could likely pull off
something similar to what Apple did, and to great benefit.

~~~
fulafel
NetApp have making good money with custom NAS boxes, complete with their own
OS and filesystem, since 1993 or so.

------
rsync
I searched the github page and this HN comment thread for the string "frag"
and got nothing ...

I don't know if the authors are here, but if they are - would you comment on
fragmentation and the dangers of growing a filesystem past 95-98% full ?

In the world of ZFS, performance can become significantly degraded with as low
as 90% space filled. Further, our experience has been that you can
_permanently degrade_ filesystem performance by churning the usage above 95%
for any significant amount of time. Which is to say, _even if you reduce usage
back down to 80%, the zpool maintains poor performance until it is destroyed
and recreated_.

This is exactly what you would expect to see with a fragmenting filesystem
_that has no defrag tool_.

Unfortunately, creating a defrag tool for ZFS is a very daunting technical
hurdle and it appears that nobody is interested in pursuing it.

How does TFS behave ? Does it have, or do you plan for it to have, a defrag
utility ?

~~~
ticki_
Author here.

> I don't know if the authors are here, but if they are - would you comment on
> fragmentation and the dangers of growing a filesystem past 95-98% full ?

Fragmentation isn't an issue in TFS, at all. Because it is a cluster-based
file system. Essentially that means that files aren't stored contagiously, but
instead in small chunks. The allocation is done entirely on the basis of
unrolled freelists.

This does cause a slight space overhead (only slight, coming from the fact
that metadata of the file is stored in the full form), but it completely
eliminates any fragmentation.

~~~
gpm
I only have a basic understanding of harddisks/filesystems, but won't that
slow down reading/writing on harddisks since the chunks won't be in order and
close together?

~~~
ticki_
With modern hard disks, no. They work in sectors.

~~~
amadvance
I suppose you mean SSD disks.

Rotating disks are surely affected by not contiguous operations, even if done
with sector granularity.

------
zx2c4
While there are many good arguments to be made against AES in favor of ARX
construction ciphers, the choice of SPECK for this is not okay. The correct
choice of an ARX cipher would have been something like ChaCha20 or Salsa20.

~~~
api
At the risk of discussion hijack, what are these arguments? Any links?

(I mean on ARX generally. Agree about Speck.)

~~~
fpgaminer
Compared to AES, ARX ciphers:

1) Are built from constant time operations, which means they are naturally
resistant to side channel attacks (timing, cache, power, etc).

2) Are far simpler in their construction. This makes them easier to reason
about and analyze.

3) Related to #2, this also makes them really easy to implement, which means
less likelihood of some coding mistake.

Beyond that, most recent ARX ciphers also have a few other advantages over
AES. For example, Threefish has a built-in tweak field, which makes using it
infinitely easier in practice.

EDIT: In case you're hungry for more detailed explanations, I highly recommend
reading the papers for Salsa/Chacha and Threefish. They're very well written,
easy to understand even if you don't have a lot of experience with
cryptography, and they have sections that explain the design decisions in
enlightening detail.

~~~
dom0
ARX constructions are also easier to tune for high software performance, and
generally don't require special hardware support, because all CPUs already
have fast ARX operations built in.

~~~
tptacek
That's true, but:

* The CPUs that have better-than-ARX (like, fast constant time multiplication) can do better than ARX

* Fast ARX ciphers are still slower than Intel AES hardware.

I like Salsa/ChaCha more than AES, but there's a reason AES is so popular, and
it's not incompetence or a conspiracy.

------
inlineint
I wonder would it be possible to use it somehow with Linux (in kernel space,
not with fuse, because fuse has work slower because of necessary context
switches from kernel to user space). I mean it is interesting can a wrapper
kernel module be written for interfacing with Rust code, or there are some
obstacles that would prevent from doing it efficiently.

~~~
deno
[https://github.com/tsgates/rust.ko#kernel-api-and-abi-
stabil...](https://github.com/tsgates/rust.ko#kernel-api-and-abi-stability)

~~~
djsumdog
The unstable Kernel ABI is really interesting. It's one of the reasons we see
all these shims between proprietary drivers and the kernel (nvidia/amd), that
and licensing of course.

~~~
hpcjoe
I know this is true on the nvidia side, but I think it is less true on the AMD
side [0]. The old AMD drivers may have been like this, but it appears they
have changed.

[0]
[https://github.com/RadeonOpenCompute/ROCm](https://github.com/RadeonOpenCompute/ROCm)

------
ysleepy
What cipher block mode is used with SPECK? Block based disk encryption is
complicated to get right due to replay attacks of blocks over time and IV
issues. There are established best-practice compromises with AES, but I dont
know if they apply to other block ciphers and doubt they are tested.

~~~
tptacek
XEX, unfortunately. That's a mistake. Unauthenticated tweakable wide-block
cipher modes are designed for simulated hardware disk encryption. That's not
what an encrypted filesystem is: a filesystem knows where files begin and end,
and has space for metadata. Filesystem encryption should use authenticated
encryption.

------
legulere
I do not buy the argument against AES. On both ARM and x86 you have side-
channel secure hardware implementations of AES.

~~~
nightcracker
ChaCha20 would be the right choice here, not SPECK.

~~~
jedisct1
Or if they really want a lightweight cipher with a small block size, they
should consider SPARX, especially since a Rust implementation is readily
available: [https://github.com/jedisct1/rust-
sparx](https://github.com/jedisct1/rust-sparx)

~~~
jstarks
Has anyone analyzed SPARX other than the authors? It seems a little early to
recommend its use.

------
grub5000
Improved caching TFS puts a lot of effort into caching the disk to speed up
disk accesses. It uses machine learning to learn patterns and predict future
uses to reduce the number of cache misses.

See that does sound like a good idea - I've always observed HDDs with an SSD
cache to have a phenomanly useless caching system

~~~
primer42
I'm not a big fan of putting machine learning into a file system. Usually you
can't understand why a machine learning algorithm is doing what it's doing,
and I would be worried about a production server suddenly having massively
different performance because the cache learning algorithm started doing
something differently. Interesting idea, but I would want to battle test it
before I bought in.

~~~
paulddraper
It's impossible to tell what's going on nowdays with optimizing compilers,
memory overcommitment, hardware branch predictions, power saving measures,
etc.

------
snvzz
This can't be seen as anything else than a research filesystem that tries
several new things at once. Doomed from the start.

A good design would look at the state of the art and use the best techniques
available. If the aim was research, then try one new thing, not a thousand.

For actual promising new filesystem efforts I'd be looking at HAMMER2, Tux3
and F2FS.

~~~
ticki_
> A good design would look at the state of the art and use the best techniques
> available. If the aim was research, then try one new thing, not a thousand.

That's what it does: It takes from many sources (although mainly ZFS).

------
edward_rolf
>The system will never enter an inconsistent state (unless there is hardware
failure), meaning that unexpected power-off won't ever damage the system.

What is the difference here between a hardware failure and a unexpected power
failure?

~~~
notacoward
A power-off is a simple fault, with fairly well-defined effects. It's actually
one of the easiest cases for a data-storage system to deal with. "Hardware
failure" includes all manner of crazy Byzantine faults, many of which are
literally impossible to deal with. What this is saying is that TFS's model of
the underlying hardware is that it will either execute all writes correctly or
stop executing any.

Sadly, a lot of hardware has much more complicated behavior than that. A lot
of RAID cards in particular will lie about what was actually written, so in a
power-off scenario later writes might have made it while earlier ones didn't,
writes can be incomplete, etc. I don't mean this as a knock against TFS. It's
more of a suggestion that the fault model be expanded to include at least a
few more possibilities.

------
desdiv
Is the choice of SPECK a serious decision or is it some kinda of a
political/satire/parody thing?

~~~
gpm
I'm fairly sure it's serious, though if you look at the specification it's
made to be easily swapable. It's specified by a 'vdev', which is a layer on
top of the core filesystem specified by a 16bit int. You could easily make
another for ChaCha20 and use that instead (or also).

------
Handgemenge
The interplanetary file system [1] comes to mind. Not direct comparable to
TFS, but its feature set could inspire a local file system, too:

\- historic versioning (like in git)

\- deduplication

\- authenticity through cryptographic hashing like in block chain

\- distributed delivery like bit torrent

I don't think, that we need another ZFS or btrfs competitor. Just my 2 cent.

[1] [https://ipfs.io/](https://ipfs.io/)

------
equalunique
Is this the TFS that Redox OS is building after falling short on their ZFS
goal?

~~~
ticki_
TFS was created to speed up the development. The issue is that following the
design specs makes it much slower to implement, and prevents a "natural"
development (like, you cannot implement it like a tower, you need every
component before completion). It was started[1] and got far enough to reading
images, but implementing it took ages, so we decided to put it off for now.

It is very similar to ZFS.

[1] [https://github.com/ticki/zfs](https://github.com/ticki/zfs)

~~~
XorNot
This doesn't quite seem to follow? ZFS's pool model has supported flagged off
features for a very long time - isn't the issue more that to do the things ZFS
does you _need_ to implement all the other components? And since you're
planning to do a lot of what ZFS does...

------
agumonkey
IIRC disk controllers used "ML" since quite a time already.

------
visarga
Seems exciting, but popularizing a new file system is extremely hard.

------
je42
looks like a one man/woman show. the graphs show most of the activity only by
one contributor. :(

~~~
ovao
I don't consider that a bad thing. The project may pick up more contributors
down the road — or it might not. Either way, that doesn't speak negatively of
the project itself.

~~~
je42
yes, true. But for a file system you either need a lot of time or more people.

~~~
ovao
I certainly agree. I think there are compelling arguments both for keeping the
work to a very small handful of people and for spreading the work across many
contributors. It seems to be early days for TFS, but so far it looks like an
impressive bit of work.

------
tracker1
"Team Foundation Server" (Microsoft's version control server) is what goes
through my mind... poor choice of naming...

~~~
deno
So you have to add “fs” or “filesystem“ to your search queries, or not even
that if the rest of the query gives sufficient context.

Everyone makes name conflicts in independent domains to be a much bigger
problem than they are in reality.

There’s Amazon rainforest, Amazon the ecommerce website, and Amazon the cloud
company. How often do you have any problem differentiating between them?

~~~
tracker1
Well.. let's google it... "TFS" .. hmm, abbridged lead in to the wikipedia
article at the top... And though I'm not a systems programmer likely to
implement, or support the code for a filesystem, I am _A_ programmer, and work
in IT... And systems operators are also likely to come accross TFS (Team
Foundation Server) in terms of supporting a deployment of it.

Though, they've started to refer to the source control protocol implementation
as TFVC, since TFS supports git as well now. It does seem to have some
conflicts, and even notes another file system called TFS themselves.

In this case, I'm pretty sure another name might be a better idea. Hell, TFS
the version control system and the other file system are more well known than
Firebird the database when Mozilla renamed their shiny new browser.

~~~
deno
If you’re looking for Java and the first result is the Wikipedia article on
the island[1] do you just switch programming languages?

I hear C is very search engine friendly. Probably why it’s so popular.

[1] [https://en.wikipedia.org/wiki/Java](https://en.wikipedia.org/wiki/Java)

~~~
tracker1
The point is, there's already a prevalent technology in use by the same
name... I wouldn't really expect to choose a programming language based on
google-ability... go, is to this day hard to search for on its' own, "golang"
being better.

That said, I wouldn't expect a "new" programming language called "Java-Script"
(not the current ES/JavaScript) to gain traction. Or for that matter a
programming language called "Coffee" to be very successful either.

~~~
obstinate
You're tilting at windmills. Most threads where someone starts something new
has someone making similar complaints to the one you're making here. Yet
people's behavior doesn't change. And it won't -- it's just too much overhead
to avoid the ever-expanding space of products that have the same name.

~~~
tracker1
Yes, but there's at least two other filesystems called TFS as well, per other
threads... so even then, it's still overload. If I were releasing something to
the public, I would probably namespace or consider something different in this
case.

~~~
deno
I did. It’s still not synonymous with “popular.”

------
hestefisk
Good luck developing something better and more stable than zfs...

