Hacker News new | past | comments | ask | show | jobs | submit login
The structure of the special type of DOS files called directories (1998) (tripod.com)
91 points by luu 29 days ago | hide | past | web | favorite | 68 comments

> Archive bit is somewhat symbolic. It should be set if the file was not archived by the backup utility. Never in my life I have seen the use of this bit.

I have - back in the days of DOS' BACKUP.COM and RESTORE.COM utilities.

I believe the way it worked was that when you created or edited a file, DOS set this bit, and when BACKUP.COM archived it, that cleared the bit again (or was it the other way round?) That allowed you to do a very early form of incremental/differential backup, and had a whole section explaining it in the DOS 3.somethingorother manual I first learnt very basic system administration from.

Back in those days our school PC came like many others with two floppy drives called A: and B: - which is why hard disk names traditionally start at C: - and you backed stuff up with a complicated schedule of put system disk in A:, boot, put data disk in B:, load program then it tells you to swap disks so you take out the system disk and put the backup disk in A: and when you're done you put the system disk back in again.

You could set and clear the archive bit manually with ATTRIB +/-A FILENAME.

Linux/UNIX has this too, through mtime (a time not a bit)!

Plus it's idempotent, unlike the archive bit! (e.g. if you lost your last backup, you have a time, not a cleared bit, so you can redo it)

The most complex operation due to some klugey MS stuff is 7 pseudocode steps... if that's hellish, then engineering is not a good career choice.

"What can go wrong with long filenames? Give some space to your imagination..."

They can get out of sync, and then they're going to float around until a long-filename aware system clears them out. Shockingly, systems that aren't aware of a new standard aren't aware of the new standard. ¯\_(ツ)_/¯

I've been working on a language for handling data structures much like this, and have some preliminary notes on a strategy for handling this scenario: https://tenet-lang.org/versioning.html

Yeah. You can give Microsoft a lot of crap, but what never ceases to amaze me is how inventive they were in keeping everything as compatible as possible, which they usually (successfully, mostly) pursued at all cost. Of course, that came with a high price, in terms of complexity and kludginess.

It's pretty interesting because in many cases the software evolved faster than anyone could keep up with, and there are large portions of backwards compatibility that I suspect were mostly accomplished through trial and error, especially when it comes to their layout algorithms.

When they opened up the Office file formats, they wrote an XML spec that corresponds to the fields in the various binary formats. And there are sections of it where the specification is simply, "this should be formatted the way MS Word 5.1 does it."

And then they released WindowsME, with broken memory management.

This is not hell.

I have had mercifully little exposure to this world, but I once had to implement ACLs for a storage system my company bult. We stored ACLs, and needed to make them available through Linux and Windows interfaces.

Linux was easy.

Windows (ca. 2006) was a fucking nightmare that puts this document to shame. From what I could tell, there was no concept of an API. There were data structures, and the interpretation of them depended on many things, including minor OS version.

For all the quirks and facepalms to be found in the FAT logic, I have to give credit to the staying power of FAT, and its universal compatibility. Its relative simplicity makes it perfectly adequate as the default filesystem for short-term storage (e.g. USB flash drives) where things like resilience and journaling are not an absolute necessity.

Its 4GB max file size was a pretty big problem near its end of life.

Apart from media files, where does that limit really hurt?

Backups ..or generically tarballs/zips/..etc. There have been way too many times I've had to split zips/tarballs etc just to copy over stuff over a fat formatted usb disk. Of course I no longer do that but there was a brief period when this was a problem.

I remember back in Commodore days, ARC or LHARC, or one of the archiving programs had the ability to span 1541 disks. But I can't remember why that was necessary. Under what circumstances was I moving files around that were 3x the size of the computer's memory?

Good question. Even for programs and games that spanned multiple disk, you’d explicitly arrange the data so that it makes sense (sometimes even duplicating some of it), and not just uniformly spread it across all the disks.

I can’t remember a single instance where there was the concept of a file that spanned multiple disks on the C64. In theory, the hard drive add ons might have lead to the use of such files (they don’t have to fit into memory at once, you can just seek within the file), but given their relatively low popularity, that was probably very rare, if there was any support for such files in software at all...

Wish there were a FUSE for windows... I mostly format external drives as NTFS as it's easier to get support on mac and linux than anything else in windows.

Scientific datasets, for one.

> universal compatibility

Afaik all operating systems read and write UDF formatted sticks and UDF is quite simple as well, and has none of the limitations FAT has. Also no patents.

I had never heard of this, and you are indeed right!


Frustratingly, neither Windows nor Mac exposes the options to format a disk in this way in their GUIs, but it can be done via the CLI in both cases!

> The entry with the name consisting of exactly one dot is the pointer to the root directory

Is it? Wasn't single-dot entry pointer to current directory since DOS 1.40?

If only exfat wasn't patent-encumbered :( extfat-fuse works ok, but I'd love there to be a universal filesystem for external devices.

"If the first character has the code 05, then actually the first character has the code E5 and it is not a special character. If the first character has the code E5, then the file was deleted"

I'm struggling to interpret this, why would the code be changed to E5?

I assume a file starting with 05 isn't deleted.

Choice of E5 for a deleted entry:

#1 E5 is a "sync byte" it cannot be rotated and mis-interpreted: E5 E5 E5 ... can be used to synchronize a bitstream. Note that floppy discs are read bit-wise.

#2 Empty 8" floppy discs came pre-formatted with E5 written everywhere.

#3 05 is a "control-e" in "extended 8 bit ASCII (IBM encoding)". E5 was a usable character.

With CP/M, the disc bitmap was produced when the disc was "loaded". All directory entries were scanned, and the allocation bitmap was produced. A freshly formatted disc would have E5 fill, and thus would have no files. MS-DOS had a separate allocation table, which was also the file linkage table. So, the strategy of a fresh E5 filled disc being taken as empty no longer works. But, the key of a deleted entry having E5 as the initial character was still used.

Hope this helps.

Thanks for that. I hadn't come across the idea of a sync byte. For anyone else, the binary representation of E5 is 11100101. No matter where you start reading you're always going to know whether you are reading from the start of the byte or not. Contrast with null (00000000) its impossible to know where you've started from.

This has a bit of info ftp://ftp.apple.asimov.net/pub/apple_II/documentation/misc/disk_encoding.doc.txt

E5 is an extended ascii code, so I think 05 is acting as an escape character so that the filename won't be interpreted as deleted in the case the filename actually starts with å.

“Note that filename cannot consist solely of spaces, but extension can.”

This is the sort of thing hat makes me wonder what on earth went wrong or even could have been right about Microsoft. Why would spaces be permittable in a file extension at all, ever?

They are not. The article mentions before that sentence that the spaces are used for padding. In DOS, filenames are fixed 8+3 structures, and unused trailing positions are padded out using spaces. You cannot, for example, have a space at the beginning of the extension, nor is the space at the end considered part of the extension at a higher level.

So it’s really just an internal representation detail. You might argue that padding with 0 would be better (if you go for fixed size at all), and nowadays that’s what you’d probably use, but it’s kind of arbitrary anyway, and since spaces were illegal in file names, they just used that. Today, 0 has the useful property of also acting as the terminator for zero terminated strings, which became popular through C. Back then, that mattered less.

Contrast this with modern file systems, where spaces in extensions are allowed (just like anywhere else in the file name, as the distinction usually does not exist on the fs level anymore).

I don't know about that specific issue, but some of the weirdness in MS-DOS is based on precedents set by CP/M. It's not really compatible but recycled many of the same concepts.

Can't blame the horrible way they implemented long filename support on CP/M.

I was mildly impressed at how they packed the datefield into a 16 bit value, losing only odd numbered seconds in the process.

It’s horrible, but compatibility often leads to horrible things. Remember that they wanted disks with long file names to work on old systems, and have them be able to manipulate files with long file names as usual (through their short names), in all kinds of manners.

I’m not saying it’s good, but MS at the time had compatibility as one of their highest priority, so that’s what they were going for.

(But the parent’s comment is about the initial implementation, not long file names.)

I'm saying that even taking reverse compatibility into mind (which they didn't even have, FAT32 couldn't be read by a FAT16 driver) the way the filenames are encoded is a mess with the name being split up by attribute and checksum bytes.

One has the impression that it looks like it does because they didn't want to touch most of the FAT16 driver code and instead tried to make touchups around the edges.

Not sure what you mean. They did it precisely for compatibility. It worked for FAT16 (and probably FAT12, for that matter) as well, and in fact long file names came before FAT32. Anything that would trip up older DOS versions or any other FAT consumers was a no-go.

IIRC Long filename support was introduced with Windows 95, which shipped with FAT16. It wasn’t until OSR2 that they introduced FAT32.

In theory FAT32 could have introduced a new on-disk data structure but I guess they wanted the minimum required changes to support larger disks.

You could boot into pure DOS in win9x even on fat32, and pure DOS does not support long file names. Perhaps they kept that hack to be compatible with the fat32 "light" implementation of the DOS that shipped with win9x.

Because "empty"/unused characters of the name (including the extension) shall be filled with the space character.

As the names are fixed to 8.3 there is no such thing as length detection/indication for the string.

That’s a description of part of the problem, not a reasonable excuse.

Didn't lots of SQL databases' CHAR(n) type used to be space-padded instead of null-padded too? If I had to guess, DOS simply copied that idea: filename=CHAR(8), ext=CHAR(3) with three blanks the "NULL" value.

> Didn't lots of SQL databases' CHAR(n) type used to be space-padded instead of null-padded too? If I had to guess, DOS simply copied that idea

DOS couldn't have copied what lots of SQL databases did, since DOS was being written around the same time as the first SQL databases.

My guess would be that, like the three character extension itself, this came from CP/M, not SQL.

So you can have files with "no extension"? The whole land of DOS is about fitting into stupid constraints of the era's HW.

But if there were "no extension", that would be a 0-length extension, rather than a 3 character extension with 3 spaces.

Even if it has to be 3 characters, shouldn't they at least be \0?

Since spaces are in fact not allowed, choosing between 0 and 0x20 is pretty arbitrary here. You’d probably go for 0 today because C became popular and 0-padding has the useful property of also 0-terminating the string, but at the time that was less of a consideration.

I assume any extension that consists solely of any combination of \x00 and \x20 is to be understood as "no extension"

In what world are files required to have an extension?

They weren't required to, but the convention for the shell was, unlike unix with its X bit:

extension COM: just load the file into memory and start executing from the very beginning (I think the max size of these might have been 64KB or something).

extension EXE: this is a "full" executable with different segments - load it properly

extension BAT: this is a "shell" script, run it with the interpreter

The advantages of COM over EXE were slightly smaller file size on disk and slightly faster startup as you didn't have to set up segments and stuff first. Some versions of DOS even had a program called EXE2BIN that converted an EXE file into a COM file, if this was possible. The joys of the 16-bit "tiny" memory model ...

The space for the name is pre-booked, so in a sense all files always have a full 8-char name and 3-char extension, whether they like it or not.

If you want to support <8-char names, <3-char extensions, or no extension at all, you need to decide on values that indicate this, because those bytes will still need filling with something.

(On 6502-based systems, 0 would be a reasonable choice for a padding value, because you get a free zero check every time you read a byte. The 8086 doesn't work like this... I don't think any particular value suggests itself. Perhaps they thought picking a printable character would be useful.)

Picking the space character, apart from being I think an established convention already (the CHAR(n) type in databases), definitely has the advantage you can just "blit" the filename onto the video buffer in text mode when doing a DIR command. You don't have to bother with checking its length or anything, just copy 11 bytes and insert a space (not dot in the original DIR listing format) after the 8th one. If the file doesn't actually have an extension, this just prints the right number of spaces so the other columnms are aligned.

EDIT: see https://en.wikipedia.org/wiki/Dir_(command)#/media/File:Spar... for an example.

Why even treat the extension as something separate from the filename?

The table at the beginning of the article should answer that: to save a byte. The dot is not saved, FILENAME.TXT is saved as FILENAMETXT on disk.

I'm not sure if it actually worked that well in DOS, but in Digital systems, which inspired the DOS shell, it was useful for default file name parts. Using analogy to Unix, the current working directory is useful as a default - to avoid having to retype it. Same with the extension, although with a different semantics.

Early DOS stuff was written in straight assembler, so to search for *.FOO you can just:

Read a chunk of directory entries into a buffer.

Set ptr to start of the buffer.

Check the extension at ptr[8].

If extension matches, do a thing.

Set ptr to ptr + 32 and loop.

Since the source code has been released [0], i decided to check it out and sadly it isn't that smart. Searching is implemented in DIR.ASM [1] by reading each entry one by one, matching is done only on the filename (encoded as 11 bytes for FILENAMEEXT) and it only handles "?".

Note that this is for MS-DOS 2.0 which didn't handle "*.FOO" searches on the kernel side (the star expansion was done on the COMMAND.COM side via the MakeFcb [2] procedure). MS-DOS 3.0 introduced the star wildcard on the kernel side, so it might have taken advantage of that... but there isn't source code for MS-DOS 3.0.

[0] https://github.com/Microsoft/MS-DOS/tree/master/v2.0/source

[1] https://github.com/Microsoft/MS-DOS/blob/master/v2.0/source/...

[2] https://github.com/Microsoft/MS-DOS/blob/master/v2.0/source/...

Wouldn’t you have to look how COMMAND.COM implemented * matching then, assuming that the filename buffer has a similar fixed length fields representation?

Also note that DOS 2.0 incurred quite significant changes to fs code, as it was the first version to support directories. Before that, the namespace was flat.

All commands (implemented in the TCODE?.ASM files) call PARSE_FILE_DESCRIPTOR (defined in MISC.ASM) which just calls MakeFcb (defined in FCB.ASM) that does the star parsing. See lines 127-133, CX contains the remainder of characters not parsed, if it is star the rest of the characters are replaced with ?s, so a FOO<star>.E<star> (...HN eats the stars) would become FOO?????.E?? (the subroutine MUSTGETWORD "under" MakeFcb does the actual replacing for each part, once for the filename and once for the extension - see the calls to it and how the CX is set up for the lengths).

MS-DOS 1.25 does pattern matching mostly the same way [0]. The code is a bit simpler as it doesn't have the crazy macro stuff that MS-DOS 2.0 code has, but if you look a bit around the files, you'll see that it is mostly the same code just moved around and split in several files. The file scan works again in almost the same way, reading each directory entry one by one and doing the pattern matching after the fact.

[0] https://github.com/Microsoft/MS-DOS/blob/master/v1.25/source...

Thanks for the thorough analysis! Very insightful. So the 8+3 structure still simplifies matching: To my recollection at least, in DOS a * would always wildcard out the rest of the filename component (I did not look in the code), so things like FOOxBAR.TXT weren't possible. That way though, the common x.TXT still is.

Read "x" as *, because I could not for the life of me get the star work as a star within a word in this comment.

My memory is hazy, but the way that most DOS programs handled file iteration was to use FindFirst/FindNext, and that used data in the program-segment-prefix (PSP) - which was borrowed from CPM.

Yes, the second link is essentially what "FindFirst" calls. A few lines below is the meat for "FindNext".

Yep, this sounds like the likely answer. Remember that when this structures came up, the “OS” was indeed just a bunch of hand-coded 8086 (and earlier) assembly routines.

For me, having gotten my ears wet with CP/M, I seem to recall it was so there was a cheap means of indexing files by type if required, since there weren't much in the way of sorted lists built-in, and folders weren't really a thing either.

Yep. The extension was the only way a program knew what type of data to expect from a file.

There was no space on a disk for metadata. You couldn't put metadata on a tape. There wasn't enough memory in the computer for the OS to determine a file type by reading a header, and even if there was, some operating systems had their DOS in ROM, so that would mean never adding any more file types (I'm looking at you, CBM PRG, USR, SEQ, REL files!).

Hah, at the time, “USR” files were the biggest mystery to me. I knew PRG, SEQ and REL, but this „custom“ file type seemed very special.

Apple had a file type and nice, long names.

I’ve heard once that the design of FAT was so straightforward that one would be hard pressed to enforce any patents on it. And that Gates designed it on an airplane trip.

Does anyone know if any of this is true?

Don't forget that DOS 1.0 didn't have subdirectories or long file names, so most of the complexity in the article comes from years of shoehorning new functionality into it.

I did not expect tripod.com to still exist.

Seeing a post hosted on tripod.com brings back so many memories. It's my favorite thing about this submission.

Makes me remember creating files by hand using Norton Utilities' diskedit directly on the raw bytes. Those were the days :)

i wanted to comment on that lol

"Welcome to hell."

All i needed to read :)

Well 3 people dislike this i find that funny

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact