
Rethinking Files - UkiahSmith
https://www.devever.net/~hl/objectworld
======
erichanson
Nice to see folks rethinking files, as they're a scourge on the planet and an
antiquated anti-pattern that has been holding back the industry pretty much
since its inception. I don't know how anyone could take a look at /etc for
example, and consider it anything but archaic. The adduser command is some
1130 lines long, and all it does is do CRUD on files, to name just one
example. Then there are countless config files that just have to be edited by
hand and happily accept syntax errors and logical errors. No modern system
would tolerate this.

The root of the problem with files is that they lack an information model,
beyond just a sequence of bytes. They are unopinionated to a fault. All files
have structure. Even if that structure is a "non-structure" like "all these
files are just a random sequence of meaningless bytes", then that is their
structure. But this information isn't present in the system, nor can it be
enforced or constrained when that is desirable.

To me, the obvious alternative is the database, aka "everything is a row". We
have used the database (relational or otherwise, but mostly relation) to
successfully model many many domains, and bring coherence and clarity to them.
The cool thing about the relational database is that it's based on an
underlying relational algebra. The syntax of data in an RDBMS is really just
one manifestation of a deeper layer of structure that is syntax-free, and
these abstract structures can be (and are) manifested in multiple coexisting
syntaxes.

I'm exploring this pattern ("datafication", headshake) with Aquameta
([http://aquameta.org/](http://aquameta.org/)) and written a lot more about
why file-centric is holding us back ([http://blog.aquameta.com/intro-
chpater2-filesystem/](http://blog.aquameta.com/intro-chpater2-filesystem/)).
Boot to PostgreSQL! :)

~~~
gjs278
files can be modified by any text editor in /etc it sounds like you are
advocating for a system similar to the windows registry. it can easily corrupt
and can’t be fixed by live cds or other operating systems that have the
filesystem driver. it would be a massive step backwards.

~~~
erichanson
I agree the system you describe would be a massive step backwards. :)

No, I'm saying the OS needs an information model as a first-class citizen. But
since you brought up data corruption: This hypothetical OS could also benefit
from having transaction support up in application space -- to avoid data
corruption -- something most "modern" programs don't even have, even though
most file systems do.

To be fair, I do think the database needs a way to edit the contents of a
field using your favorite text editor. But we've got a pgfs plugin that uses
FUSE to make the database accessible as a filesystem as well.

------
zokier
> People like Unix's “everything is a file” approach because what it really
> means is “everything is exposed to the same nexus”. It means you need only
> ssh to a system and you have all the power to reshape all aspects of that
> system with a single interface, the command line, using a common set of
> highly composable tools

But at least in Linux there are ton of files that are _not_ exposed to the
"same nexus", i.e. filesystem. The most common example would be network
sockets. They are files, but do not exist anywhere in filesystem. In Linux
file is more of an object handle.

[https://yarchive.net/comp/linux/everything_is_file.html](https://yarchive.net/comp/linux/everything_is_file.html)

[http://events17.linuxfoundation.org/sites/events/files/slide...](http://events17.linuxfoundation.org/sites/events/files/slides/fd_0.pdf)

~~~
delish
Yes, people call it the "Unix philosophy" not the "Linux philosophy." And the
Plan 9 folks (nee Unix folks) also thought everything in Unix should be a file
and not everything was.

[edited for clarity]

~~~
amelius
Instead of calling everything a "file", you might as well call everything an
"object" then. It also seems more correct.

~~~
xfer
Calling it object would mean you have different methods of accessing it. But
files have a fixed set of methods/properties.

~~~
amelius
Not sure what you mean, because e.g. ioctl() call has many functions which are
not sensible for regular files.

In any case, the article has 17 occurrences for "file(s)" versus 28 for
"object(s)", so the author seems to agree with me :)

~~~
xfer
Not every file respond to ioctl() and that's not what people mean when they
say "unix philosophy". Yes, in a sense ioctl() does model the object nature,
but they aren't discoverable and rather adhoc, as the article points to
powershell and capability for reflection for how a OOP based resource access
would look like.

------
Lowkeyloki
I found the URL addressing scheme of Redox to be fascinating, if perhaps
slightly less user-friendly compared to files and file paths.

[https://doc.redox-os.org/book/design/url/urls.html](https://doc.redox-
os.org/book/design/url/urls.html)

~~~
hlandau
Personally, Redox's use of URLs seemed like really bad design to me. It
doesn't get simpler than the Unix path syntax.

Having a scheme:// makes sense for URLs because you don't otherwise have any
contextual information indicating _how_ to access a resource. But this isn't
the case for something like a virtual filesystem, where the total set of
filesystems mounted under it - and their types - are all known to the system.
There's no need for disk://foo when you can just have /dev/disk/foo.

~~~
mpweiher
That's true when the namespace covers objects that are very similar to access,
ideally identical.

If that's not the case, I have found the scheme to be helpful to indicate
what's going on.

~~~
hlandau
On *nix, you can always figure out what type of filesystem is mounted at a
given prefix by typing `mount`.

What the use of schemes does is make things needlessly inflexible, and embeds
a dependency on the name of a filesystem provider inside consumers of that
filesystem. It's akin to a Unix where filesystems can only be mounted in top-
level directories /mnt, but not /mnt/foo, etc.; I don't see the appeal.

~~~
yjftsjthsd-h
I prefer to use `df -T /path/to/mount`, personally.

~~~
vageli
> I prefer to use `df -T /path/to/mount`, personally.

Why?

~~~
yjftsjthsd-h
Lets me specify a single file system; I don't think `mount` does that (unless
I'm blind; possible).

~~~
vageli
Mount lets you do this as well.

mount -t type device destination_dir

Unless I am missing something in your use case.

------
EmilStenstrom
I recommend opening up developer tools and adding this before reading this
article:

    
    
      body {
          width: 40em;
          margin: 0 auto;
          font-size: 1.4em;
          line-height: 1.4em;
      }

~~~
saadat
Or use the Reader View/Reader mode in Firefox/Safari.

------
OJFord
What if the interaction were more like OOP - the File class wouldn't
necessarily make sense as the top parent.

Would be kind of interesting to call methods on objects rather than read/write
files, but it's not immediately obvious to me that that really gains anything
over the status quo.

And now that I've written that, I wonder is that what powershell's verb-object
does anyway? I've never come close to proficient enough (nor wanted to!) to
know.

~~~
hlandau
This is pretty much what I mean. Essentially I'm proposing that the "base
class" shouldn't support anything other than close(). You could have objects
which don't support read()/write(), but custom verbs with different semantics
appropriate to their type. Tape drives (and only tape drives) could support a
rewind(), etc.

~~~
mbreese
I think it might be more helpful o think of what semantics could be supported
by the everything is a file operations. A tape drive rewind() is just a
specialized version of seek(), which any random access object would need to
support.

The file metaphor is soooo flexible, so it’s hard for me to think of examples
where it breaks down. So, what are some good examples where the file metaphor
breaks down? Maybe that’s helpful?

~~~
hlandau
The trouble is, a tape drive can support seek(). But can it support it
performantly for all arguments? seek(0, SEEK_SET) is easy. seek(1024,
SEEK_CUR) is easy - just read forward a little. But seeking to some arbitrary
fixed offset? As far as I'm aware, tape drives are designed to use filemarks
for 'searching', not precise offsets.

Of course seek(n, SEEK_SET) could be implemented anyway, in a very un-
performant (and tape-wearing manner): by rewinding, and then reading forward n
bytes. There's a question of whether the utility of this is desirable when
weighed against how surprising it may be to people who don't realise just how
bad the performance will be, especially when a tape drive which only supports
seek(0, SEEK_SET) can easily have this behaviour emulated on top in userspace
by seek(0, SEEK_SET) followed by dummy reads, if you really want it.

read() and write() and seek() prove remarkably versatile, but the niggles come
with the fact that different types of file/device on POSIX can have subtly
different gotchas with these verbs which, on the face of it, appear to be the
same verb. Essentially, I might argue they're not the same verb at all - they
just seem similar.

For example, read() from a UDP socket and read() from a normal file have
extremely different semantics. If you read() with a 64 byte buffer from a UDP
socket, the message is truncated and the remainder of it is lost. This is a
very different semantic to reading from a file, where you can read in whatever
chunks you like.

I wrote the article upon reflection of precisely this attempt to force
everything into the straightjacket of everything-is-a-file that we've had for
decades with UNIX. How much code correctly deals with short write()s?
"Everything can be expressed as an object on which you can perform
read()/write()" can only be true if you ignore the details of a verb's precise
semantics, but the precise semantics are important. I think it's fair to argue
that write() isn't one verb at this point, but an overloaded verb referring to
a set of verbs. Which verb in that set you're invoking is dependent on the
type of "file".

------
mpweiher
That's kind of the point of stores:

[https://github.com/mpw/MPWFoundation/blob/master/Documentati...](https://github.com/mpw/MPWFoundation/blob/master/Documentation/Stores.md)

and Polymorphic Identifiers:

[http://objective.st/URIs/](http://objective.st/URIs/)

Hierarchical paths were a good idea, let's use them. Objects were also a good
idea, let's use those. A small set of verbs (GET, PUT, POST, DELETE) was also
a good idea. Let's combine these!

Abstract from:

    
    
       Path    + File       + POSIX I/O
       URI     + Resource   + REST Verbs
    

Get:

1\. Polymorphic Identifiers, which subsume paths, URIs, variables, dictionary
keys etc.

2\. Stores, wich resolve URIs, subsume filesystems, HTTP servers,
dictionaries, etc.

3\. A small protocol that essentially mirrors REST verbs in-process

See also: In-process REST,
[https://link.springer.com/chapter/10.1007/978-1-4614-9299-3_...](https://link.springer.com/chapter/10.1007/978-1-4614-9299-3_11)

~~~
hlandau
The whole point of my article was that constraining things to a small, limited
set of verbs is actually a bad idea.

Theoretically you can just make up your own verbs for HTTP and use those. In
practice people stick to the common ones because they're well supported. This
leads to people massaging a problem domain into the straightjacket of
GET/PUT/POST/PATCH/DELETE, regardless of how well it fundamentally fits that
set of verbs. (I'm also convinced nobody actually knows what "REST" means, but
that's another rant for another time.)

~~~
k__
What other verbs could be needed?

~~~
hlandau
Let's suppose that I have an HTTP endpoint for a doorbell (for some reason).
The doorbell resource is represented as
[http://example.com/doorbell](http://example.com/doorbell).

There's no RING HTTP method, and I could invent one, but heaven knows if
various HTTP middleware would be happy with that. In practice, people do
something like

    
    
        POST http://example.com/doorbell/ring
    

The problem with this is that you now have a hierarchy of verbs; you have
first class verbs (GET, PUT, POST, PATCH, DELETE), and second class verbs
which have to be represented as distinct resources. This feels like a hack to
me.

~~~
k__
Ah okay.

But isn't this basically what RPC vs REST boils down to?

As far as I know people tried the RPC way for years then gave up on it and
started doing REST. Seemingly because inventing a whole bunch of methods was
inherently flawed.

------
syn0byte
Your not solving anything, at best you are getting maybe one extra level of
abstraction by shifting _potential_ complexity in the application. It may or
may not care about internal file "schema" and thus has no code for it. Shifts
to _concrete_ complexity in the system; Your application doesn't utilize file
schema but some applications might so everyone gets a schema field and there
is a bunch of extra code and complexity to support it.

From a security/reliability standpoint it sounds like a nightmare combining
the worst of things like NTFS alternate data streams and share library loading
into one.

------
leoc
See my earlier comment,
[https://news.ycombinator.com/item?id=14542595](https://news.ycombinator.com/item?id=14542595)
.

Lotus Agenda/Chandler
[https://en.wikipedia.org/wiki/Chandler_(software)](https://en.wikipedia.org/wiki/Chandler_\(software\))
is another part of this long Grail quest.

------
bayareanative
Files are too finite, low-level and lose generate/parsing knowledge that is
implemented N times in N places. OSes should read and write message-oriented
streams of records (pb, capnp or similar.) that are invisible to the user,
while tools and code see data and data structures. This solves many problems
of unnecessary and repeated effort parsing log files, log file rotation,
proprietary file formats, portability, compatibility and extensibility.

Also, programs should be able to dynamically-serve the contents of "files" as
well with an "activation symlink", i.e.,

    
    
        /etc/resolv ->* resolvconf
    

The "the everything must be plain text" refrain is obsolete and unnecessary
because it's trivial to serialize anything to any format since it would
already be an universally-supported data structure both in tools and code.

It's not 1978 anymore.

~~~
RcouF1uZ4gsC
Sounds a lot like WinFS
[https://en.m.wikipedia.org/wiki/WinFS](https://en.m.wikipedia.org/wiki/WinFS)

------
O_H_E
Two sic projects that can help managing files until we get another system

TMSU - tags your files and then access them through a virtual filesystem from
any other application

[https://tmsu.org](https://tmsu.org) \--
[https://github.com/oniony/TMSU](https://github.com/oniony/TMSU)

Tagsistant - Semantic filesystem for Linux, with relation reasoner,
autotagging plugins and deduplication

[https://www.tagsistant.net](https://www.tagsistant.net) \--
[https://github.com/StrumentiResistenti/Tagsistant](https://github.com/StrumentiResistenti/Tagsistant)

------
solidsnack9000
The examples given at the end, where verbs are commands at certain paths,
looks a lot like a special file system. All the printers are under `/print`
and all the print commands are under `/print`. One could imagine all the
database tables being under `/db` and all the commands being under `/db/bin`.

------
ubrpwnzr
Another site, can we please just add something like this:

<style xmlns="[http://www.w3.org/1999/xhtml">](http://www.w3.org/1999/xhtml">)
body{ max-width: 600px; font-family: "Calibri"; margin-left: auto; margin-
right: auto; }

</style>

------
tgbugs
I've done some silly things [0] with python's pathlib recently that seem
related to the issues discussed here. Given that smalltalk message passing
finally clicked for me durnig the process, I am attracted to an object-like
solution for everything (or a file-object-like solution for everything, since
the practical performance advantages are undeniable). That said there are some
considerations both for the low level implementation, and for high level
things like affordances for 'file' operations.

In direct response to the suggestion about file paths for verbs. Allan Kay
says in one (possibly many) of his talks something along the lines of 'every
function should have a url.' The one of surely many challenges is how to
ensure that the mechanism used to populate file system paths with nested
functionality (e.g. /usr/bin/ls/all to `ls -a`) don't trigger malicious
behavior during service/capability discovery. Being able to more deeply
introspect file data and metadata as if the file were a folder could
potentially be implemented as a plugin, and I worry about the complexity of
requiring a file system to know about the contents of the files that it hosts,
or that the files themselves be required to know about how to tell the file
system about themselves. Existing file systems adhere to a fairly strict
separation of concerns, since who knows what new file format or language will
appear, and who knows what file system the file will need to exist on.

Said another way I think that the primary issue with the suggested approach is
that it is hard to extend. The file system itself needs to know about the new
type of object that it is going to represent, rather than simply acting as an
index of paths to all objects. If there is a type of object that is opaque to
the current version of the file system that object either has to implement a
file-system-specific discovery protocol (which surely would have fun security
considerations if it were anything other than a static manifest) or the user
has to wait for a new version of the file system that knows what to do with
that file type.

Some thoughts from my own work. (partially in the context of OJFord's comment
below)

Treating files and urls as objects that have identifiers, metadata, and data
portions and where the data portion is treated as a generator is very
powerful, but the affordances around the expression local_file.data =
remote_file.data make me hesitate. When assignment can trigger a network
saturating read operation, or when setter doesn't know anything about how much
space is on a disk, etc. then there are significant footguns waiting to be
fired and I have already shot myself a couple of times.

The more homogeneous the object interface the better. However, this comes with
a major risk. If the underlying systems you are wrapping have different
operational semantics (think files system vs database transactions) and there
is no way to distinguish between them based solely on the interface (because
it is homogeneous) then disaster will strike at some point due to a mismatch.
To avoid this everything built on top of the object representation has to be
implemented under the assumption of the worst case possible behavior, making
it difficult to leverage the features of more advanced systems. As with the
affordances around local.data = remote.data, if I have a socket, a local file,
a remote web page that I own, a handle to an led, a handle to a stop light, a
database row in a table that has triggers set, the stdin to an open ssh
session, and a network ring buffer all represented in the same object system,
I have as many meanings for file_object.write('something') as I have types of
objects, and the consequences and side effects of calling write are so diverse
(from flipping bits on a harddrive to triggering arbitrary code execution)
that it is all but guaranteed that something will go horribly wrong. At the
very least there would need to be a distinction between operations where all
side effects could be accounted for beforehand (e.g. writing a file of known
length to disk has the side effect of reducing free disk space, but that is
known before the operation starts), and operations where the consequences will
depend on the contents of the message (e.g. DROP TABLES), with perhaps a
middle ground for cases with static side effects (e.g. the database trigger)
but that would not immediately visible to the caller and that might change
from time to time.

The distinction between files and folders is quite annoying (non-homogeneous),
especially if you want to require that certain pieces of metadata always
'follow' a file. This is from working with xattrs that are extremely easy to
loose if you aren't careful. Xattrs are a great engineering optimization to
make use of dead space in the file system data structure, but they aren't
quite the full abstraction one would want. It is also not entirely clear what
patterns to use when you have a file that is also a folder -- do you make the
metadata the outer file and the data the inner file? Or the other way around?
Having the metadata as the outer file means that you can change the metdata
without changing the data, but that the metadata will always 'update' when its
contents (the data) changes. However, when I first thought about using such a
system, I had it the other way around, and a system with that much flexibility
I suspect would have even more footguns than the current system.

Another issue is the long standing question around what constitutes an atomic
operation. Everything is simple if only a single well behaved program is ever
going to touch the files, but trying to build a full object-like system on top
of existing systems is a recipe for leaky abstraction nightmares.

While I was working on this I came across debates from before I was born. For
example hardlinks vs symlinks. There are real practical engineering tradeoffs
that I can't even begin to comment on because I don't understand the use cases
for hardlinks well enough to say why we didn't just get rid of them entirely.

0\. [https://github.com/SciCrunch/sparc-
curation/blob/master/spar...](https://github.com/SciCrunch/sparc-
curation/blob/master/sparcur/paths.py)

~~~
hlandau
>and the consequences and side effects of calling write are so diverse (from
flipping bits on a harddrive to triggering arbitrary code execution) that it
is all but guaranteed that something will go horribly wrong.

This is why I suggest we really need the ability to dynamically add new verbs.
POSIX has one write() but in terms of semantics it's really a whole family of
verbs as one overloaded method.

>The file system itself needs to know about the new type of object that it is
going to represent, rather than simply acting as an index of paths to all
objects.

What I had in mind was that a given filesystem driver (e.g. a userspace FUSE
process) would provide object types it supports. So for example, a "printer
FS" process, printerfsd, would provide printer objects under e.g. /printer/.
But the vfs - the layer that does prefix matching on mount()ed filesystems
wouldn't need to know about new object types, as it's just a dispatcher.

One shortcoming of this is that you can't mv /printer/foo to another
filesystem. That's also a shortcoming of e.g. today's /proc or /sys, but there
still seems to be enough that's worthwhile about this approach.

