Hopefully this isn't too hard of a rant, but file descriptors are... neither files nor descriptors. They literally don't "describe" anything about anything, and whatever they refer to doesn't need to be a file either. They're literally the exact opposite - 100% opaque integers, as opaque as anything in a program could possibly be, that could refer to pretty much any kernel object.
Why they were ever called file descriptors in the first place (esp. given the one thing they lack is any description), and why they couldn't just start being called "handles" (like on Windows) or "object IDs" or something else that at least makes some modicum of sense, is beyond me.
> file descriptors are... neither files nor descriptors
This is a pretty modern and userspace-centric view of file descriptors IMHO. The "description" part is in the file description table on the kernel side of the userspace/kernel boundary (see struct file[1]) for what are extremely obvious reasons.
Up to V7 UNIX (1973-1979[2,3]), the file description table could literally only reference a file on disk, UNIX domain/TCP/UDP sockets weren't introduced until 4.2BSD.
Your gentle rant makes sense from a modern point of view of course, but we need to keep in mind that the UNIX design is over 50 years old now. You could argue that the 4.2BSD people should have used different tables/names rather than overload the file description table but that ship has sailed and here we are.
Looking at [2], They did not. Only has FWRITE and FREAD, later on in the linked v7 they added FPIPE, not sure how TTY's are handled back then, but I bet it was a device driver of some kind with the real basic BIOS those systems had.
> Why they were ever called file descriptors in the first place
Human laziness, no doubt. What you refer to as "file descriptor" is really a pointer to a file descriptor. It is likely that over time "file descriptor pointer" or "file descriptor handle" was shortened, and then ultimately accepted into the lexicon.
They represent resources managed by the kernel, exposed to userspace via an integer handle. The already established convention for opening and closing files fits that use case. You can duplicate and pass that handle to child processes, and the resource itself is reference-counted for free. They are "abstract files" with specific behaviours, if you will.
Originally, a file in computing referred to a group of punch cards stored in a box. If this is the definition you hold on to, then signalfd and pidfd do not represent a file. In fact, by that definition nothing in modern computing is a file.
However, as the technology evolved and punch cards went out of style, everyone (except maybe you) accepted "file" as an abstract representation of said box. signalfd and pidfd fit within that abstraction.
Should I preemptively start into your next question of why Linux has tty when there is no teletype to be found?
I don't think using the term "file" for any kernel-managed resource is common outside of Unix terminology. It is certainly not the terminology of Windows. It is not even really true on modern Linux either, nor was it ever fully true in Unix land - Plan9 is probably the only OS that really went for "everything is a file" to the utmost degree.
For example, in either Unix or Linux, a socket is not really a file, even though an open socket has an associated fd. But you can't open a new connection by using open()/read()/write() on any Unix or Linux system (you can in Plan9). So these are not really files. Similarly, most devices expose special ioctl() codes to perform operations on them and barely interact with read() or write().
> Plan9 is probably the only OS that really went for "everything is a file" to the utmost degree.
Plan9 took the "everything is a file" metaphor and extended it to "everything is a file in the filesystem". A file doesn't necessarily have to be found in a filesystem, even if that does add a level of convenience. If we look back to what a file is, a system to keep track of all of your punch card boxes is clearly useful, but if you scatter those boxes around the office haphazardly they are still files.
Outside of Unix land, a Process or a Socket or a Mutex or an Event or a [...] are not considered files. The Windows APIs for example have no concept of "a file" as a generic thing. In Windows, you could say that File, Mutex, Socket, Process, Thread, Event and many others are all "Objects" referenced by "Handles".
It's also notable that the term "file" most likely derives more from the concept of a paper file as the place where one stores information rather than the specific punch card files you mention. The expression "[something is] on file" to mean that "some entity is storing this information" predates computers and is very likely related to use of the term "file" for "something that is stored [by a computer system]".
Regardless, using the word "file" to refer to a process or a TCP connection (in the form of a Berkley socket) is entirely specific to Unix and systems inspired by it, and is seen as a key part of the Unix philosophy (via the expression "everything is a file").
> It's also notable that the term "file" most likely derives more from the concept of a paper file
Yes, that's right. The dictionary defines file as "a folder or box for holding loose papers that are typically arranged in a particular order for easy reference." Which is literally what a box of punchcards is – hence how the word made its way into computing.
Obviously modern computers do not actually use files in any way. All we have today is a loose abstraction that pretends to represent a file, and within that abstraction Unix-like systems place processes and TCP connections. These are all files in the abstracted sense. Other systems may choose to do things differently. Again, modern computers don't have actual files, just a pretending that they do.
For better or worse, we have come to agree that abstraction can also be named "file", even if it is not actually a box full of paper. Which is true of a lot of things in computing. It turns out that floppy disk button you often see doesn't have anything to do with floppy disks. Hard to believe, I know.
My point is that the way this made it to computing is not so much the specific "file of punch cards", but through a more abstract route: "a physical file is a way to store information in an organization" -> "anything that stores data in a system in general can be called a file" -> "data stored on a computer system is called a file". We used the expression "to have on file" / "to put on file" to mean "to have stored in some way, not necessarily in a physical file" even before computing was a thing.
So, "a file" in the figurative sense means "something that is stored" - whether that's a physical file in a file cabinet or someone's memory, or an entry in a ledger etc. This matches the term "file" as it is used in Windows and in most other operating systems, and the way it was used in the earliest Unix systems. However, it doesn't actually match the way it is used in modern Unix systems - a process or an event or a mutex is not "something that is stored". So, calling these things "files" is unusual and doesn't match the normal figurative meaning of "file", within or without computing.
> My point is that the way this made it to computing is not so much the specific "file of punch cards", but through a more abstract route
That is not what the history books tell. If you have a different story to share, I'm sure the record would love to be updated. I'm rather surprised you haven't already shared your story with us. I guess you don't really have a point.
> So, "a file" in the figurative sense means "something that is stored"
I expect you are thinking of the parallel definition of "file", the one which often goes along with the word "folder". Which is a funny one as in the real world the folder is the file as it is abstracted in that model, but anyway... What we speak of may not be a file in that sense, but as it happens, words can have multiple meanings. In fact, you just spent half your comment telling us that, so there is no turning back now...
> If you have a different story to share, I'm sure the record would love to be updated.
Sure, just looking at the Wikipedia article on "Computer file" [0], we see that the "punched card file" was only an early use of the word in relation to computing.
But they show a slightly later use [1] that matches our modern notion much better, and that uses the exact expression "on file", without referencing punched cards in any way, so quite clearly not in a way derived from the practice of keeping information on punched chard files:
> [...] But such speed is valueless unless - with comparable speed - the results of countless computations can be kept "on file" and taken out again [emphasis mine].
> Such a "file" now exists in a "memory" tube, developed at RCA Laboratories.
Wikipedia also shows that "file" was often used in the early days of computing to refer to the storage location, the physical storage device, not the contents in memory. For example, what we'd today call "disks" were called "disk files".
Wikipedia ultimately claims, unfortunately without a citation, that the modern sense of the term "file" came to denote the contents rather than the physical storage once the first "file systems" started being used, and those were managing multiple "virtual files" (that is, what we'd call virtual disks today).
> I expect you are thinking of the parallel definition of "file", the one which often goes along with the word "folder"
Yes, since that is quite obviously the metaphor most lay people have been taught in modern times at least: your data is organized in file folders, each containing multiple files. Basically in this context, "file" is more or less a synonym for "document". This is the most directly applicable non-computing equivalent of the modern computing term "file" (and it, again, doesn't match most things that are "files" in Unix).
And yes, of course words have many different meanings. A file can also mean something that you use on your nails, but that is obviously not relevant to what we're discussing.
Seriously, I'll bite, explain exactly how a signalfd or a timerfd fit the abstraction of a box of punchcards?
Even the FreeBSD man pages (because their authors have an ounce of sense) stopped calling them files. They call them object descriptors or just descriptors and they're literally "references to objects"
Explain how they don't. Realize it is all pretend. There is no actual box of paper anywhere anymore, just the pretending thereof. The fun thing is, when pretending, you can pretend whatever you want. If you want to pretend that a signal is a file, then it can be a file. There is nothing tangible behind it that ensures a signal can't be a file.
There is separate computer abstraction that emerged which you may know of by something along the lines of "files and folders". Funnily enough, that one is, indeed, confused. A file, by definition, is what the folder is trying to abstract in that model. But this parallel use of "file" is no doubt why FreeBSD moved to using "object", to try and avoid using the same word to mean different things. But in the end, words are allowed to have more than one meaning, so "file" is equally usable and it just comes down to arbitrary preference.
Simple, because the intersection of all the interfaces of these different "file" types is nothing other than close(). At that point the abstraction has lost all ability to describe anything common better than just OS managed object. When a noun is used as an abstraction, this is often a good benchmark of utility, better than just your arbitrary preference.
File in computing may have always been an abstraction, but it started out in relation to some sort of data storage and an appropriate interface, usually also addressable by some index (a filesystem). If you have a object with open/close/read/write/seek, you at least have something. Then we drop seek, and add ioctl. Now the cracks have formed, but there is still something there, there is still data involved and things still have a common source: the file system.
But at this point these other fds, the only thing common is close().
You can try to go the a different (plan 9) route, but that requires something to tie these together, a filesystem. And that is not something the systems in question have.
> But in the end, words are allowed to have more than one meaning
But once an abstraction has lost its usefulness people do not have to go along with the stupidity. I don't even have much care about this. Call them files. But where it is a bit irritating people still go on to say "everything is a file" when describing useful attributes of modern UNIX likes as if this illustrates some deep insight of this abstraction. This "everything" involves one operation: throwing the thing a way. It's stupid.
We should retcon file as a shorthand for "roundfile".
My understanding is that the descriptor is the pointer, the description is the pointee (i.e. the entry in the table). The distinction is sometimes quite relevant, for things like dup(2).
edit: there must actually be a double indirection from the fd to a table of pointers to file description, as the table is process private while the description is shared (and multiple fds can refer to the same description).
They describe the state of a stream. Which are flags and a current offset at a minimum. On linux you can examine /proc/<pid>/fdinfo/<fd> to see what it 'describes' precisely.
The file descriptor doesn't describe anything. It's like your social security number, it's merely a unique identifier for something else. Does your SSN describe you as a human? What do I know about you as a human after I hear your SSN? Absolutely nothing.
> What do I know about you as a human after I hear your SSN
You know what part of the country I was born in, and using that, you can guess my age range. The interface is an opaque number, the implementation describes state, some of which is guaranteed to be present, one of which you can access using lseek(), which is a system call. In this case the state is even completely deterministic.
Actually in the modern era SSNs are assigned randomly. Anybody born before about 2011 who hasn't needed a re-assignment will indeed have SSNs which in some sense contain information, but after this date they're just arbitrary numbers (though only a subset are ever issued, no real people's SSNs have "area number" 666 even though these are now random)
While opaque for a subset of the API, it's not fully opaque since certain assumptions are made by the client based on their value: they are guaranteed to be reused and minimal.
What this means is I can do close(0); open(...) to redirect stdin.
> the implementation describes state, some of which is guaranteed to be present, one of which you can access using lseek()
>> Which are flags and a current offset at a minimum.
Uhh... no. Try that with pipes, sockets, FIFOs or any of the myriad other file types, eventfds, signalfds, kqueue, etc there are. Most open fds on a data center server likely don't meet your criteria.
> Even POSIX disagrees for the concept you are trying to convey cf "file description"
You should have kept reading:
3.258 Open File Description
A record of how a process or group of processes is accessing a file. Each file descriptor refers to exactly one open file description, but an open file description can be referred to by more than one file descriptor. The file offset, file status, and file access modes are attributes of an open file description.
"The file offset, file status, and file access modes are attributes of an open file description"
That says file description there buddy, not descriptor. The two are defined separately, that is all.
The single integer does not describe state in the way those terms are typically used; we have clearer vocabulary: refers, points to, etc that's fine. But it is a distinction with a difference. (They're not even unique).
> You're attempting to make a huge distinction between interface and implementation. I can't rightly determine why.
Well to start, for the same reason POSIX seems it necessary: they are distinct concepts that warrant distinct treatment.
And I can't determine your insistence on calling different things by the same name. Especially after you yourself called them out as "opaque" yet insist that they describe something. What does opaque even mean?
And this isn't even an interface vs implementation issue. Do you not see why it is useful to a user of the interface to understand why when they do a seek (assuming they can) on fd=1 that the offset of fd=2 or 4 or 6 or whatever may change. Two (or more) distinct entities yet somehow changing some shared state. What do you propose we call that state? The file descriptor? Then what do we call the thing returned by open()? They can't be the same because I can dup() that thing returned by open and have a new unique "thing", but it still changes the same state.
> Different integers are always unique. It's the file description that isn't.
The file description, however it is implemented is a uniquely identifiable entity. An object doesn't lose its identity just because it has multiple aliases, especially so when the lifetime of those aliases is ephemeral. The concrete embodiment of this in Linux or BSD would be the kernel address of the struct file.
This whole article is terribly confusing. Take this paragraph for example:
Now, your process might depend on other system resources like input and output; as this event is also a process, it also has a file descriptor, which will be attached to your process in the file descriptor table.
What event? Are input and output an event? Why is this event its own process? Input and output are not a process are they?
Also, does a process have its own file descriptor table? That was never mentioned before and this reads like it is already known.
This sort of stuff goes on in my head throughout the entire article...
It's also still unclear to me what happens if multiple processes try to access the same file. Do file descriptors help to lock files during writing?
The writing style just sucks and it reads like a LinkedIn post with every sentence in its own paragraph. It tries to be approachable, but it uses blurry undefined terms and overly-simplistic analogies.
Starting with 101, 102 in the first example, for some reason.
When a process or I/O device makes a successful request, the kernel returns a file descriptor to that process
I/O device?
By default, three types of standard POSIX file descriptors exist in the file descriptor table
Types?
Apart from them, every other process has its own set of file descriptors, but few of them (except for some daemons) also utilize the above-mentioned file descriptors to handle input, output, and errors for the process.
What?
It makes an impression of a poor translation of a pretty low-effort article, tbh. You’re better off just reading the corresponding APUE section, which you must have read anyway.
* File descriptor (for as long as it's open)
* File descriptor number (can be replaced by close+open or dup\*; there are also special values like `AT_FDCWD`, `OPENMAX` (not necessarily equal to `FD_SETSIZE` or what `ulimit` limits you to)
* Open file description (unchanged across dup and fork ...)
* Files (e.g. separate open() calls that happen to refer to the same file, whether by the same name or not)
* File names
* file descriptors to special files, to things that are not files, to files outside of the current chroot, to files from a directory that has been mounted over, to files that have been moved or deleted
* what exactly does mmap hold on to?
* recvmsg cmsg can make file descriptors appear. Fortunately this is the most common use of cmsg, but remember you can get more than you request (but IIRC there's no clean way to get the number given, the API is underspecified)
* There's really nothing special about file descriptors 0, 1, and 2; they're just a strong convention that processes tend to have open at fork time. In practice, you can live without stdin and stdout, but stderr can be written to by all sorts of library functions.
> The lsof command is used to list the information about running processes in the system and can also be used to list the file descriptor under a specific PID.
> For that, use the “-d” flag to specify a range of file descriptors, with the “-p” option specifying the PID. To combine this selection, use the “-a” flag.
> $ lsof -a -d 0-2147483647 -p 11472
That works. A nicer way to do it:
htop > F3 (search by name to find the process) > Enter (to select) > L (open the list of file descriptors) > F4 (filter by resource name)
It lists all open files and other resources of the process.
Get the pid of the process you want to inspect, and while its running, execute `ls -lh /proc/<pid>/fd/`. It will list the open file descriptors for that process.
You're welcome! I found that after KDE replaced ksysguard with system monitor. The former could list all open files of the proecss, but the latter can't. So I was digging for some good tool and found this obscure htop feature :)
> For example, if you open a “example_file1.txt” file (which is nothing but a process)
I’m very confused by this use of “process”. This isn’t one? There’s a note further down talking about how closing it will make it available to other processes that doesn’t make sense either.
I think of file descriptors as a void* across address spaces
In C you often use void* for opaque handles.
But you can't have a user space pointer into the kernel, since it's in a different address space. So you instead have an integer that's unique within a particular process, and then a per-process table in the kernel that points to the real data structures.
DOS and Windows calls them file handles, but they serve the same function. DOS even has the same 0, 1, 2 for in, out, err. Of course they all inherited this design from Unix.
Of course they all inherited this design from Unix.
As far as Microsoft is concerned, Windows and MSDOS inherited this design from XENIX, which was Microsoft's clone of V7 UNIX. That XENIX was later sold to the Santa Cruz Operation, then later again SCO XENIX was renamed to SCO UNIX.
It's fascinating to wonder how different the computing world would be today if Microsoft had used their XENIX as the underlying base for Windows, instead of the way they did do it.
Linux would probably have never got off the ground, and remained a curiosity like the MINIX it was based upon initially. And Microsoft would completely own the UNIX world. Thank God they didn't.
no, not a clone, xenix was licensed unix v7, licensed from AT&T. They did not have a license to use the trademark unix, so they called their version xenix.
unix was already widely available when Linus started tinkering, he wanted to play with source, and BSD's were still tangled up in copyright. Had closed source Windows been based on unix, linux's open source hegemony would have toppled Windows too
I'm guessing here based on how they're implemented on Windows, and someone should correct me if I'm wrong, but I would think it's meant to guard against consuming (comparatively expensive) kernel-mode memory. Expensive because they can't be paged to disk [1], and thus actually consume RAM, thus allowing even lowly-privileged applications to completely exhaust physical RAM and thus make the system unusable.
FWIW, I don't think Windows has such a limit, so this makes me wonder if this is really the concern. I'm not aware of a better reason.
[1] AFAIK, on Windows, kernel objects (which the kernel cares about) use the non-paged pool, but handles to them (which user-mode cares about) use the paged pool. So you don't technically need to limit the number of handles per se, but it seems like a convenient proxy.
Why they were ever called file descriptors in the first place (esp. given the one thing they lack is any description), and why they couldn't just start being called "handles" (like on Windows) or "object IDs" or something else that at least makes some modicum of sense, is beyond me.