

A Tale of Two File Names: deconstructing the checksum in Windows 8.3 file names - galapago
https://usn.pw/blog/gen/2015/06/09/filenames/

======
WalterGR

        There are so many layers of abstraction in the Windows API
        that it’s a miracle that anyone could maintain it - which probably
        explains the increasing level of bloat in new versions of Windows. 
    

Are there fewer layers of abstraction in other popular OSEs?

~~~
acdha
Yes – it's a product of the way Windows was actually two operating systems,
the Windows 3/95/98/ME lineage and the newer OS/2 / NT lineage.

The NT side had the advantage of being designed at all rather than growing
organically by overworked developers hacking in whatever they needed right
now, and had a number of assumptions (e.g. not starting with the 16-bit API
real-mode model) which avoided some gnarly hacks.

The problem was compatibility: most of the apps had been developed on 16-bit
Windows 3 or, later, Win95 and at the time Microsoft's dominance was far from
a given so they were pathologically afraid of breaking compatibility, which
meant that the Windows “platform” included a lot of weird semi-or-undocumented
corners designed to avoid breaking specific apps and the NT side had to
reimplement bugwards-compatible versions of most of them to avoid breaking
shipped apps.

(This might seem excessive – and I would generally agree – but it's important
to remember that much of the damage was done in the pre-internet era when
shipping updates to software meant putting a box in the main with a pile of
floppies or, for the really rich people, CDs. Getting someone to upgrade to a
version of an app which didn't rely on an implementation quirk could take many
years.)

Raymond Chen has written about this extensively at
[http://blogs.msdn.com/b/oldnewthing/](http://blogs.msdn.com/b/oldnewthing/)
and one of my favorite examples is the Shell Folders registry key:

[http://blogs.msdn.com/b/oldnewthing/archive/2003/11/03/55532...](http://blogs.msdn.com/b/oldnewthing/archive/2003/11/03/55532.aspx)

The closest you come to this on another mainstream OS is OS X, where they
maintained Carbon (i.e. the supported subset of the classic Mac OS APIs) on
top of the modern core but that was both more limited and was rapidly
deprecated because Apple is far less concerned with breaking backwards
compatibility than Microsoft used to be.

~~~
WalterGR
Thanks. I think that some backwards compatibility can be handled by more code
and not necessarily more layers of abstraction, but I get your point.

It seems like there are lots of layers in Linux distributions because of the
Unix philosophy. For example, to put a window on a screen, aren't the layers
something like this:

App -> window manager -> desktop environment -> X -> graphics driver (->
hardware)

Is that a good approximation? What does that look like on e.g. Windows and Mac
OS?

Is the situation on Linux similar for audio? From my armchair, my impression
is that audio on Linux is/was... complicated.

~~~
KMag
> App -> window manager -> desktop environment -> X -> graphics driver (->
> hardware)

No, the window manager isn't in that chain. You can kill the window manager
and your applications can keep running and drawing to the screen, but you
can't move or resize windows. Also, instead of "Desktop environment" you mean
GUI widget toolkit.

~~~
WalterGR

        Also, instead of "Desktop environment" you mean GUI widget toolkit.
    

By desktop environment I meant e.g. Gnome or KDE. Aren't they separate from
the widget toolkits?

~~~
Dylan16807
Gnome or KDE is just a package including a window manager and a bunch of
mostly-standalone programs that give you a toolbar, file browser, settings
panels, etc. It's not a layer that anything goes through. You don't even have
to use the window manager it comes with.

------
jstanley
Really interesting article, sounds like a fun adventure. Couple of points:

1.) "0x12b9b0a5. This equals 314159269 in decimal. Yep, that’s the first 8
digits of pi right there" \-- actually, it's not. The first 8 digits of pi are
3.14159265, it's not clear why the Microsoft developers ended with a 9 instead
of a 5, perhaps a mistake? perhaps to help with coprime-ness?

2.) The "ANSI C" version uses mbstowcs which isn't any kind of ANSI C function
I've ever heard of.

Fantastic article, really enjoyed reading, many thanks.

~~~
Quackmatic
> The first 8 digits of pi are 3.14159265

Those are the first 9 digits if you count 3 - the 9 at the end is the 9th
digit, not the 8th, so the first 8 digits are correct. You're correct about
the ANSI C part though, it's actually C99 - I've corrected that part. Thanks.

~~~
jstanley
pi begins: 3.141592653589 - I miscounted 8 instead of 9, but the number is
still wrong.

EDIT: Oh, I see what you mean. He's saying the first 8 digits of that are the
first 8 digits of pi, although the 9th is incorrect. Got it.

------
poizan42
> I’d need to write a device driver to call it, which is both something I’ve
> never touched at all before, _and something that’s nigh impossible without
> access to the DDK or NDK_.

Well the DDK is called WDK now, but anyways how does one not have access to
it? It's right here: [https://msdn.microsoft.com/en-
us/windows/hardware/hh852365](https://msdn.microsoft.com/en-
us/windows/hardware/hh852365)

------
rplnt
> and I certainly wouldn’t be able to get access to the NT source code.

I thought that Windows 2000 source leaked. Would seem like an easier path.
Though certainly not better (for either the author or a reader).

~~~
userbinator
What was leaked was most of NT4, and only part of 2k. The extra checksum was
introduced in 2k, but I don't think the leaked source has that part.

------
ape4
Can this checksum routine be contributed to ReactOS. Or is it not allowed
since he disassembled Windows.

~~~
RandomBK
The routine itself would be copyrighted, but someone should be able to make a
clean room reimplementation of the original algorithm, assuming its not
protected by patents. (IANAL)

~~~
WalterGR

        make a clean room reimplementation
    

Which, as far as I understand it, could be someone posting a human-language
description of the algorithm here (based on the blog post,) then ReactOS
writing code based on that description. (IANAL.)

~~~
ape4
Including the crazy constants.

------
tryp
A few years ago I implemented the storage system for a special-purpose
diagnostic camera. The specification defined (very long) filenames for the
saved images using a timestamp and some other data. I used a mostly off-the-
shelf microcontroller/NAND/USB mass storage reference implementation, hooked
it up a side-channel to the FPGA, and had everything working pretty nicely.
Until the test harness that just continuously commanded pictures to be taken
reached 105 iterations. After that, the camera timed out waiting on the
storage subsystem to store the image.

The problem turned out to be the code that found the 8.3 filename: it did the
longfi~1.bin, checked to see if that file existed and if so, incremented to
longfi~2.bin, then checked that... but never did the checksum trick described
here, just kept iterating. (bear in mind this was a tiny 8-bit microcontroller
that didn't have the RAM to just read all the directory entries at once and
keep them around for comparison) Finding the proper 8.3 filename this way took
longer than the timeout period after 104 collisions.

Of course, we only cared about the long filename and never saw the 8.3
filename, so my fix was simply to use an appropriate hash of the long filename
to ensure a good probability of uniqueness.

~~~
TheLoneWolfling
Ah yes, the wonders of accidentally quadratic functions.

------
userbinator
The question that immediately comes to mind is what happens if the checksum
collides?

~~~
Quackmatic
If the checksums collide, then the number after the tilde is incremented
again, (eg. SOBC84~2.ASP). This time, it won't stop at ~4, so you can go up to
~10 and beyond. The file name will be shortened accordingly to fit the number
in (eg. SOBC8~10.ASP). This was tested on Windows 7 x64.

~~~
netsec_burn
And now the last question, how far can you take that logic? What happens at
1,000,000 (one million file names generated) when there are no more characters
left to remove from the left side?

Segmentation fault?

~~~
albinoloverats
Can FAT support that many files in a single directory?

~~~
netsec_burn
Actually, it doesn't look like it. It might need to be a race condition done
with a program (delete the old files to keep it going to a million). Might be
tricky.

~~~
Quackmatic
I've tried it. It works (this is on NTFS) and the end result is a little
anticlimactic: [https://usn.pw/blog/gen/2015/06/09/filenames/#follow-on-
coll...](https://usn.pw/blog/gen/2015/06/09/filenames/#follow-on-collisions)

