Hacker News new | past | comments | ask | show | jobs | submit login
What is a file? (2011) (microsoft.com)
89 points by polm23 on July 1, 2017 | hide | past | favorite | 24 comments



This is a very wordy paper that basically just says: People want to own their pictures (and associated (Facebook) comments/likes) and similar and want to know what happens when they delete them, and ownership may be the most important aspect to visibly preserve.


This is the problem when somebody creates something simple that just works: very quick there will be a lot of other people trying to add to it "just a little thing" and then you have a lot o complexity. It only stops when nobody can figure out what that thing became.


Nice, but change the concept of a file, and you'll run into all sorts of interoperability problems.


Yeah but it will have to be done eventually. A lot of the pain in modern programming is trickle down from bad systems abstractions we're stuck with from Unix and its contemporaries.


bad system abstractions? Can you elaborate on what a good abstraction other than a file for a operating system would be?


Just have persistent objects (or strongly typed values). LaTeX documents, tables, images, videos, songs, computer programs, etc. have very little in common, so why make them all files?


> LaTeX documents, tables, images, videos, songs, computer programs, etc. have very little in common, so why make them all files?

They have plenty of things of common -- you can make an independent local copy of them, you can copy them to USB stick and give to other people, they can be backed up to a local storage or to a cloud, they can be sent by e-mail.

Having persistent objects means every single program has to implement its own cloud syncer and "copy to USB stick" functionality. In practice, this just means heavily siloed data -- as a power user, I would hate this.


File is an abstraction that provide a consistent semantics on what operations can be performed on a piece of arbitrary data/resource that the operating system can expose. Nothing prevents you from creating another layer of abstraction that work with your various objects.

I would love to see your proposal on a unified semantics of all these possible "strongly typed values" that the OS can expose.


Please read this: http://thomas.enix.org/pub/rmll2005/rmll2005-shapiro1.pdf

Because EROS processes survive system shutdown, no file system is required to provide persistence. The operating system instead implements a transactional block store with only two kinds of objects at the disk level of abstraction: pages (which hold user data) and nodes (which hold capabilities). Human naming services (directories) are provided by applications that act as directory servers. EROS does not have any notion of a “file system root directory” or a universally shared file system at all. An EROS system is simply a very large space of objects that are connected together by capabilities. Surprisingly, the resulting persistence implementation is both simpler and faster than a conventional file system.

You can actually go further, and dispense with the operating system too, by running an image-based programming language, such as Smalltalk or Lisp, on the bare metal. Device drivers and other essential functions of operating systems then become part of the language.


I read the paper, and it appears to describe a scenario where users don't even own the computer---they just rent time on a computer run by somebody else (known today as the "cloud"). Also, it seems to describe a silo of applications where moving a file across applications becomes a significant action. I'm reminded of a story of a person, who, when asked to copy a file, loads up the application that made it (in this case, Microsoft Word), opens the file (from the harddrive) and saves it in a new location (the floppy drive), not even aware that one could just "copy c:file.doc a:file.doc".

It does not seem like a system I would like.

I'm also skeptical of image based languages (but I'm willing to admit I am biased towards files).


Thanks for the reference. A quick glance over that paper suggests they have a powerful capability system which tied with the objects provides the necessary abstraction. As such page objects themselves don't provide any benefits.

In that sense files can also be thought of objects with a set of methods (walk, open, read, write, close etc). The only difference being their object collection is graph while filesystem is a tree.

Historically unix systems have very weak capability system; there are research operating systems which try to address this but you don't need to get rid of the filesystem/file as an abstraction.

I have played with image based programming environements(pharo, CL) in the past; i am not sure how well they will inter-operate with programs written in other languages that don't share the same semantics.


Yeah, I think Microsoft did this in the 90s. Something something OLE [1] something something.

[1] https://en.wikipedia.org/wiki/Object_Linking_and_Embedding


Yup. Resource forks on classic mac for instance.


"problems that ensue when files move..." from NTFS to FAT32 never mind "...from the PC on to the networked world of today".

But as always with Microsoft: Beware geeks bearing gifts.


We are still stuck in the past:

1 - Machines use a Unix filesystem model even cruder than the Multics model it was derived from

2 - Network services and especially micro services essentially use an object model (mostly JSON)

3 - And with phones, sandboxes, containers and security concerns, applications are reverting to file silos (where a given app's data won't interoperate with other applications). This happens at a macro level too where the FBs of the world understandably want to rope you into their "walled silo".

Is it even possible to embrace a more flexible and powerful model at this point or are we stuck?


For the majority of people, the benefits of Facebook's walled garden approach are cheap, immediate and easy to understand. The disadvantages are only problems for a minority. What's more, Facebook can afford to hire a whole marketing team to promote its approach.

Until someone can show concrete benefits for the masses, we will continue in the current direction.


I believe that as long as these issues remain discussable, even if only in obscure academic and eccentric circles, at some point some ambitious person will find a solution. This might take 5, 20 years, 50 years, 2000 years I don't know. But I believe in the long-term.


I love the file hierarchy. Miss it on my iPhone.


Objects are linked to each other in different ways, so some kind of database (I prefer graph databases) with a proper search engine aware of different object types (instead of just grep) would be better.


> The first suggestion for a way forward is perhaps the most obvious: it entails rethinking the role of metadata. [...] But metadata is also now becoming central to what users understand as a file, though they might not always think of tags, comments, playlist information and so forth as metadata. For what a file is is now often bound up with the things added to it, not only by the originating user but by others too.

> Consider for example, behaviours reported by [5]. In their study of teenagers and their virtual possessions, participants reported that part of the value of photos posted on Facebook was the metadata associated with them: comments and ‘likes’ were so pertinent that they were sometimes printed out alongside photos and pasted onto bedroom walls as a collection. This materialisation of the digital is indicative of a difficulty associated with the current technological landscape.

> It is not clear how one would digitally export a Facebook photo in order to view it in this way with another computer program or application, and this remains so despite recent innovations in the Facebook service. Yet it is not surprising that users should want to treat these entities in the way they treat a file. If they can upload their photos to Facebook, and given that they do so the photos are file-like objects, why can they not download them again, while retaining the value they have accrued, but still with the benefits of file-like properties? Although it is now easier for users to export their data from Facebook, these exports, once represented simply as ‘a file’ on a hard disk, lose their potency.

Certainly what we don't need is more meta-data attributes on files.

IMO one should either

1. Create a simple file format that bundles the contents of a post. For example a zip file with the media, comments, and likes and such, or

2. Have the post (for example as JSON) and media in separate files and store the references to the other files in the post, sort of like in HTML. Perhaps have the computer system be able to extract such references and let you easily operate on files that "belong together".

They even mentioned databases and relationships earlier, and grouping files together in different ways.

---

> This bundle, this new ‘file’ type, is not merely a complex data type; the important thing from the users’ point of view is that it is a mirror of the social life that the file enables.

I have no idea what they are trying to say here.

If you create a file format like I said that contains all of the data that made up the original post then you can represent that at a later point and you can choose to render it just like facebook would. Surely that's exactly what the users want?

---

> However, this immediately raises complexities. For instance, images posted to Facebook might be copied not only by the person who posted them, but also by others. In these circumstances, should these others be able to copy the metadata, the tags as well as the thing-itself? If so, what of the rights of the owner or, if you prefer, the maker of the initial file? When people copy an originating file, would they be creating a new file or would their new entity be a version of the original one? Is there an order of precedence that we are proposing and ought this to be reflected in the concept of a file that might apply?

Bits don't have color. http://ansuz.sooke.bc.ca/entry/23

Trying to accurately track origin of a post is going to lead to nothing but trouble.

Don't try to build a technical solution for something that is not a technical problem. If someone breaks copyright laws you take them to court and sort it out there.

>It seems to us that there is a distinction that ought to be made between things that are put on the web, which the originator wants to have file-like properties (even as that thing develops a social life once on the web), and those things that are posted that the user does not want to have file-like properties. The properties we are thinking of have to do with questions like whether ‘ making a copy’ means making a copy, a version of the thing itself, or having and owning (as it were) the originating thing itself and all that has ensued in that thing’s social life.

WHAT??? Just, WHATT??? Are they purposely trying to ruin the internet? It's not up to one person to decide in which manners others copy it or not. Once again, if someone is doing something illegal, take them to court. And if they're not doing something illegal, don't try and restrict what other people are trying to this.

Fuck this. No, really. I'm done reading that paper.


When I first started using computers, I did so on Microsoft Windows 95. The first applications that I used were Netscape Navigator and MS Paint, as well as a few games. Being a child when I was introduced to computers, I did not have any notion about what was going on inside of the computer at all. All I knew was that there was a screen, a mouse and a keyboard, and that I could click on things on the screen and I could type on the keyboard.

The first time I was confronted with the notion of a file was certainly when I had painted something in MS Paint and I had clicked the save button. I think I had been told to not click "My Computer", which makes sense -- you don't want a child to accidentally move, delete or rename files on your computer. Hence I had no notion of the file system. All that was know to me was the desktop, the start menu, and a select few programs accessible through either icons on the desktop or in the start menu.

I had played a couple of games on the computer, and in those I could save the game and then the next time I could resume the game someplace near to where I had last been -- or rather, I could have my father help me resume the game. So when I managed to save the painting I sort of expected that it would just show up on screen the next time I started MS Paint. When it didn't I was a befuddled for a moment but I just concluded that I didn't understand what had happened and didn't give much more thought to it. I think this is pretty typical of how most children treat situations that confuse them.

This is user level 0. You are able to move the cursor and to type a little bit on the keyboard and to run some specific programs, but that's it.

During the next few years I learn how to work with files in MS Paint and other specific applications.

However, not all files are equal. If I try to open a file that was made in one application with another application it will often either result in an error message or in garbage on the screen.

This TIED my notion of a file to specific programs, and to the content that is shown on screen for the LONGEST time. It is a bit difficult to explain what I mean here but I think that to the majority of the population of a whole, this is what a file is to them. They view a file as an icon that you can open in a SPECIFIC program. And they call that file a "<name of program> file", or they call it by the extension, but they have no idea, or they have the wrong idea, about what is the contents of the file. If you give a regular Windows user two files which are both named say .dat, (a commonly used generic file extension for data,) then they will think that those two files necessarily must be of the same kind "somehow" and that they are to be opened, both of them, with some specific, unknown program. This is bad and harmful in my opinion.

Likewise, I was quite confused for the longest time about "My Documents" and "My Pictures". I was confused by why things were being put in "My Documents" by default when those things were not things that I considered to be "documents".

Furthermore it was quite mystical to me for a long time how "My Computer" could be on the desktop at the same time as my desktop was under a folder that was within "My Computer" itself. This however is not a huge deal. Just yet another thing that didn't make sense to me while I was trapped with the graphical representation of the system.

The paper is arguing a point of view that a different abstraction should be used than the hierarchical file system. I agree to some extent but not for the same reason perhaps.

I think that the desktop metaphor is inherently harmful as a first introduction to computing. The desktop is fine ONCE you've understood how the system works from a bit of a different point of view (though perhaps NOT necessarily a lower level as such), but until then the desktop metaphor will trick you into believing very many things that are simply not true, and which are going to come back and bite you in ways like those mentioned in the paper.

As for the doing away with the hierarchical file system, I agree. Throw it out. I enjoy Unix, but I don't hold the hierarchical file system particularly dear. In fact, I think Unix has some very powerful ideas, and it sucks a lot less than Windows, but Unix is just a local optimum and nothing more.

---

Finally, on a bit of a different note, I'd like to state how I tend to think of files now.

To me, a file is data. Often that data will have been structured in a particular way, and sometimes it will have been structured in no particular way. A valid python program has a structure that allows the python interpreter to execute it. A text file that someone wrote using a plain text editor has a certain encoding but no structure beyond that. A file that was created by putting random data into it will not have any structure. A file that was corrupted will have some data that does not conform to the intended structure.

I am aware that the order of the bytes and their property of being one single, continuous unit, in persistent storage might don't match the order that is presented to applications by the operating system, but in my use of computers this has not yet mattered so I choose to ignore that fact. So I too am living a bit of a "lie" with regards to how I think about files, I admit that.

Some files are not really files, but they are convenient because they offer you a simple interface to some useful functionality. I am speaking of course of /dev/urandom and friends.

Regular files are representations of "something". Like a text, or a photo, or anything else that you can create a meaningful representation of. You can load the data into a program as long as that program has been programmed to understand the structure that is used in the file that stores your representation of your data. If the program that you would like to use does not understand that file format you either convert the file to some other format if an acceptable conversion is possible with an existing piece of software. If none exists, you implement it yourself, either into the program that you are using, because it's open source as is the vast majority of the rest of your software, or as a standalone program that does just conversion, in both cases you are able to do this because the file format is sufficiently simple, or at least the subset of the format that you need is, and the format is open. If you are using proprietary software (including using multiple pieces of proprietary software) in combination with proprietary file formats, then either

1. The proprietary piece(s) of software is/are able to do absolutely everything that you need to do (or at least, you think so), or

2. The data is not sufficiently important to you to warrant building better tools yourself, or

3. It's so difficult to build these tools that you don't have an alternative. For example, the data might come from very complicated equipment that you couldn't build yourself even if given a million years to do so, and the data is so complex that you aren't able to understand it from inspection nor from reverse engineering, or

4. The requirements changed (see also point one about thinking that the software did everything that you needed it to do), or

5. You've done fucked up. You didn't do your research and now you've stuck with this. (See also point one about thinking that the software did everything that you needed it to do.)

Anyway, once you've loaded your data into a piece of software that is able to read it, something happens, and it brings us back to what we were talking about.

---

Once you've loaded your data into a piece of software that is able to read it, or "opened a file" as is often the way that this is achieved, I would argue that you are not operating on a file. You are working on the data that was in the file. This is a very important distinction.

When you hit save, you are not saving the file. You are requesting that parts of the application state to persistent storage. Again, a very important distinciton.

If people knew this and understood this, computers would make more sense to them. They are right in the paper that the mismatch between what developers, computer scientists and others think of as files is the cause of a lot of the kinds of problems that people have when they are dealing with files.

But where did the users gain their misinformed ideas about files from? I've said it already, I think the desktop metaphor is to blame. Again, it's an ok metaphor as long as it's not the only metaphor.


October 1, 2011


Thanks, added.


"We suggest that one aspect of this adaptation is to encompass metadata within a file abstraction"

I thought File Streams would be the ideal place for metadata, except as I read elsewhere, the metadata isn't transfered when you export to the cloud. And didn't Microsoft try and fail at such with WinFS. I guess it's easier to write papers than actually build the thing. No doubt whoever does actually succeed in writing one will have to pay Microsoft a patent license.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: