Hacker News new | past | comments | ask | show | jobs | submit login
Mtime comparison considered harmful (apenwarr.ca)
120 points by panic 10 months ago | hide | past | web | favorite | 59 comments

In Ninja I sorta stumbled through some of the same issues described here. I eventually realized that the interesting question is "does this output file reflect the state of all the inputs" and not anything in particular about mtimes, and that"inputs" includes not only the contents of the input files, but also the executables and command lines used to produce the output.

If you squint, mtime/inode etc. behave like a weak content signature of the input. And once you have that perspective, you say "if mtime != mtime I had last time, rebuild", without caring about their relative values, and that sidesteps a lot of clock skew related issues. It does "the wrong thing" if someone intentionally pushes timestamps to a point in the past (e.g. when switching branches to an older branch) as an attempt to game such a system, but playing games with mtime is not the right approach for such a thing, totally hermetic builds are.

One nice trick is that you can even capture all the "inputs" with a single checksum that combines all the files/command lines/etc., and that easily transitions between truly looking at file content or just file metadata. The one downside is that when the build system decides to rebuild something, it's hard to tell the user why -- you end up just saying "some input changed somewhere".

Is there a description of Ninja's algorithm anywhere? I looked at the manual [1] and didn't quite see it.

Does Ninja use a database like sqlite? It seems like it has to if it does something better than Make's use of mtimes. (e.g. the command line, which Make doesn't consider.)

I looked at redo (linked in the article) and it uses sqlite to store the extra metadata.

[1] https://ninja-build.org/manual.html

No, sorry. And I also mixed what Ninja actually does with some random observations in that comment.

Ninja does use some database-like things, but they are just in a simple text/binary format. It's actually been long enough that I have forgotten the details.

https://ninja-build.org/manual.html#ref_log (contains a hash of the commands used) / https://ninja-build.org/manual.html#_deps (database-like thing with some mtimes, see https://github.com/ninja-build/ninja/blob/master/src/deps_lo... )

This reminds me, I should study ninja's binary format and maybe borrow it :)

The sqlite3 database used in redo was just something I threw together in the first few minutes. sqlite was always massive overkill for the problem space, but because it never caused any problems, it's been hard to justify working on it. I'd like to port redo from python to C, though, and then the relative size of depending on a whole database will matter a lot more.

Someone rewrote ninja from scratch in C at one point and it's shockingly tiny. (No tests = no extra abstractions to make testing possible.)

I found samurai and it is indeed tiny! ~3400 lines was less than I was expecting for *.[ch] !


Ninja isn't too big though. It looks like about 13K of non-test code, which is great for a mature and popular project. Punting build logic to a higher layer seems to have been a big win :)

The "popular misconceptions" section seems to have a couple of the author's own misconceptions.

* On precision, he notes "almost no filesystems provide that kind of precision" (nanoseconds), but I would honestly say the exact opposite statement. ext4, xfs, btrfs, ZFS are some of the very common file systems that support this. He cites that his ext4 system only has 10ms granularity, which is most certainly not the default, but likely a result of upgrading from ext2/ext3 to ext4. As an aside, NTFS only has a granularity of 100ns.

* It is unclear what he means by "If your system clock jumps from one time to another...". If this is talking about NTP, it's probably accurate. My first reading was "daylight saving" or time zone changes, in which case, everyone uses UTC internally and such changes don't affect the actual mtimes. (You might get strange cases where a file listing regards a file modified at 01:45 to be older than a file modified at 01:20, but if you display in UTC, you can see it's just DST nonsense)

The author links to a post which indicates that while ext4 with 256b inodes might technically store mtime with nanosecond granularity - the call that ext4 to get that time is limited by the scheduler clocktick granularity, aka HZ which is 100, 250 or 300 for most people, hence reported granularities of 10ms, 4ms, 3.33ms, etc.

You're looking at the 90th percentile, he's looking at the 99th percentile. Claiming things like "everyone uses UTC internally" is obviously wrong when many people just a few years ago were still setting up systems in localtime when dual booting with Windows. Upgrading to ext4 is also not a rare edge case.

> Upgrading to ext4 is also not a rare edge case.

Ext4 has been stable for over a decade. It's been a default filesystem on many distributions. It was the default on RHEL 6[1] which was first released over 8 years ago, and the default for the ext variants after that. It's been in use in Debian since 6.0/Squeeze or later[2], which was 2011. It's been in use in Ubuntu since 9.10[3], released in late 2009.

To be clear, your argument is that it's not a rare edge case to have a filesystem that was originally only in common use as the default variant 6-8 years ago or more for the vast majority of installations, which has persisted and since been upgraded?

1: https://access.redhat.com/documentation/en-us/red_hat_enterp...

2: https://wiki.debian.org/FileSystem

3: https://wiki.ubuntu.com/KarmicKoala/TechnicalOverview#ext4_b...

Upgrading to ext4 is not extremely rare, but the recommended procedure involves mkfs and copying files over anyway: https://ext4.wiki.kernel.org/index.php/UpgradeToExt4

In-place upgrades do have the potential to leave some non-default options for the final ext4 file system, such as 128B inodes instead of the 256B ones, which is where certain features like reduced timestamp granularity comes in.

In any case, my system was installed from scratch quite recently using native ext4. As others have pointed out and as I linked in the article, it’s likely a kernel issue. I assume many people have the same thing.

Even if you're using an offset from CMOS time, it's still UTC internally and in the filesystem.

> when many people just a few years ago were still setting up systems in localtime when dual booting with Windows

I don't see what the user has to do with how time is internally kept on the system.

It's a Windows issue (I don't keep up with windows so perhaps this was changed in the past few years, but I suspect MS hasn't done so due to back compatibility issues).

Windows 7 was released in 2009. It's been more than a few years.

True, but even though you can set:

It isn't flawless in practice, unfortunately. Internet Time Update no longer works then.

Is it the default? If so, then that's kinda neat.

> It is unclear what he means by "If your system clock jumps from one time to another...". If this is talking about NTP, it's probably accurate.

I don't know what he's talking about either. Most popular NTP clients (ntpd, chrony) will try very hard to make sure this never happens by simply slowing down or speeding up time. You don't know what will break if you just gap time like that.

There isn't really a mechanism to slow down or speed up time on Linux (this has become clear with the recent time namespace discussions), you'd have constantly re-set the time to whatever slowed-down clock you are trying to emulate -- which isn't hard but it would result in lost of micro-jumps backwards or forwards which would be more chaotic than just one macro-jump.

ntpd and chrony might do it (I'm not sure), but systemd's NTP implementation (which is widely used, even though it does have many other issues -- such as not implementing the spec properly IIRC) does just jump time when you enable NTP on a system where it was disabled. From memory, back when I used ntpd, it did the same thing but I could be mistaken.

Yes there is, and has been for a very long time, specifically to support ntpd.


Sorry, I guess you're right. I misunderstood the discussion in [1], the reason why clock frequency change wasn't included in the timens patchset is because it was decided to be far too complicated (not because current kernel timekeeping cannot do it). Not to mention that most people would want to adjust the clock speed for indefinite periods of time, which isn't what adjtimex is meant for.

[1]: https://lore.kernel.org/lkml/20180919205037.9574-1-dima@aris...

>"There isn't really a mechanism to slow down or speed up time on Linux (this has become clear with the recent time namespace discussions)"

Interesting, might you have any links to these discussions?

There is an LWN article about it[1], and the discussions are all on LKML[2].

[1]: https://lwn.net/Articles/766089/ [2]: https://lore.kernel.org/lkml/20180919205037.9574-1-dima@aris...

ntpd doesn't attempt that if the gap between recorded and actual time is far off (e.g. if the CMOS battery dies, which is the most common case I've encountered). In those cases, you're expected to manually intervene (e.g. by running ntpdate), which does indeed gap time.

Not sure about chrony, since I haven't used it (or heard of it, admittedly).

Most systems will jump the clock once at startup. This is sometimes implemented as a jump during the ntp startup script, which may also be run by the admin at any time.

The whole article is about edge cases. It doesn’t really matter if they aren’t super common: the result is that mtimes do act weird sometimes, and if you build a system that depends on them, it will also act weird sometimes.

Probably it's because this python implementation uses st.st_mtime (https://github.com/apenwarr/redo/blob/master/state.py#L314) which is a double and therefore is not precise to enough digits for maximum granularity.

doubles have ~15 decimal digits of granularity.

    $ python
    Python 2.7.13 (default, Sep 26 2018, 18:42:22) 
    [GCC 6.3.0 20170516] on linux2
    >>> 4e9 + 0.000001
In any case, I did my testing for the article using the C program linked from the article. The first timestamp in its output gives you a lower bound on your system's granularity: https://apenwarr.ca/log/mmap_test.c

Well, this limits you to ~1 us granularity, but you are right that it's not the limiting factor.

Apparently, according to https://stackoverflow.com/questions/14392975/timestamp-accur..., the ext4 driver just uses the cached kernel clock value without the counter correction that gives you a ns-precise value.

One could perhaps have an LD_PRELOADED fsync (or whatever) that updates the mtime with clock_gettime() to store it in its full nanosecond precision glory but it's probably not worth the performance penalty. That wouldn't address the mmap issue of course...

15 decimal digits isn't enough to encode a file's st_mtime seconds and nanoseconds value.

I use Perl and found this to be a problem. Like Python, it uses a double for st_mtime, and the nanoseconds value is truncated, so it fails equality tests with nanoseconds recorded by other programs (e.g. in a cache).

It even fails equality tests against itself, when timestamp values are serialised to JSON or strings with (say) 6 or 9 digits of precision and back again. Timestamps serialised that way don't round trip reliably due to double truncation.

He cites that his ext4 system only has 10ms granularity, which is most certainly not the default, but likely a result of upgrading from ext2/ext3 to ext4.

What is the granularity of your file system? Mine appears to be 3.33ms.

Nanosecond granularity on all the ones I use, which includes ZFS, ext4, and tmpfs.

Check the timestamps which are actually set on the files.

I was surprised and disappointed to find Linux sets mtime to the nearest clock tick (250Hz on my laptop) on filesystems whose documentation says they provide nanosecond timestamps.

It's not obvious because the numbers actually stored still have 9 random looking digits. But the chosen mtime values actually go up only on clock ticks. If you're running those filesystems on Linux, try it yourself:

    (n=1; while [[ $n -le 10000 ]]; do > test$n; n=$((n+1)); done)
    ls -ltr --full-time
You should see the timestamp nanoseconds increment every few files, in batches. If they were truly nanosecond accurate mtimes, they would be different for every file.

That's why some of my programs on Linux now set the mtime explicitly with a call to clock_gettime() followed by futimens(), after writing the file. To make sure the timestamps do change each times files are replaced, in case it's more than once inside a 250Hz tick.

> the .git/index file, which uses mmap, is synced incorrectly by file sync tools relying on mtime

This part implies that the index file is written via mmap, but that's not true. It is fully rewritten to a new tempfile/lockfile, and then atomically renamed into place.

Git does not ever mmap with anything but PROT_READ, because not all supported platforms can do writes (in particular, the compat fallback just pread()s into a heap buffer).

From this article I learned that build systems don't have the fundamental choice: mtime or checksum. Instead a better solution is mtime plus a bunch of other things. The article explains the faults of mtime and checksum clearly.

This insight makes me want to try redo.

One thing I dislike about redo is that it probably does not work well on Windows. Has anybody ever tried?

Redo also makes me wonder: Is a build directory just a habit from using Make or is it a flaw of redo to not support that well? With "build directory" i mean the concept where the build process generates files in an extra directory which can simply be deleted and nothing does pollute the source directory.

> One thing I dislike about redo is that it probably does not work well on Windows. Has anybody ever tried?

Yes, although only for a toy project. I'm sure it'd work fine with Cygwin or WSL, it might work with MSYS, but I've definitely had it working with busybox-w32[1].

> Is a build directory just a habit from using Make or is it a flaw of redo to not support that well?

You can use redo in a Make-like fashion by putting a single `default.do` in the root of your project that decides what to do by examining the filename it's been asked to build. That does give up some of the benefits of redo, however (since a single file builds everything, when you edit that file redo wants to rebuild everything).

Having a separate build directory that can be easily wiped is a good idea, but I'm a lot less worried about it (or things like 'make clean') now that I have 'git clean -dxf'.

[1]: https://frippery.org/busybox/

Daniel J. Bernstein, who originally designed redo, also came up with the slashpackage idea. In that system, the build directory starts out as a writable copy of the (potentially read-only) source directory, complete with all of the .do files in the case of systems built with redo.

* http://jdebp.eu./FGA/slashpackage.html

If you are going to run some sort of autoconfiguration tool, then having it generate the build directory (by copying or symlinking the .do files over) is a trivial extra step.

I think all of the authors observations are valid, but in real life I've never encountered any of them. I don't have builds triggered automatically from I notify or the like, and I'm not so fast with my fingers that it takes me less than a second between saving a file and kicking off a make in another terminal window.

And anytime I do an initial check out from any version control, "make clean" is always the first step.


hoping jdebp will chime in..

I have never before seen this idea to put CXXFLAGS into a file and treat it as a file dependency. That would also work with Make. Clever idea.

Nevertheless, a build system which does this implicitly is better, imho.

Perhaps unsurprisingly, djb seems to have pioneered this with his Makefiles, which generally produce, then depend on and run, a 'compile' script that contains the flags.

The article suggests writing explicit rules to check for changes in the toolchain.

These dependencies can be recorded automatically with LD_PRELOAD. LD_PRELOAD can redefine functions such as fopen and let you record what files are read when running e.g. cc.

This makes it feasible to record the entire relevant state of a system, files and environment variables, at build time.

An argument in favor of checksums is the use of build caches. Switching between branches on large codebases triggers a lot of rebuilds. With a build cache, that can be avoided. SCons is a build system that uses such a cache.

Depending on LD_PRELOAD is extremely fragile and finicky.

Not only can a process sidestep libc entirely by calling the `open`(2) syscall, but there are often many ways of combining function calls to achieve the same outcome. This method will also fail completely on systems that have new, previously unknown functions that are not monitored by the LD_PRELOAD solution.

Worst of all, a LD_PRELOAD solution would not cover operations that are done on the behalf of the target program by external programs via IPC (think system daemons and dbus), at least not without intercepting and interpreting all io that target does.

In short, it doesn't scale.

> Random side note: on MacOS, the kernel does know all the filenames of a hardlink, because hardlinks are secretly implemented as fancy symlink-like data structures. You normally don't see any symptoms of this except that hardlinks are suspiciously slow on MacOS. But in exchange for the slowness, the kernel actually can look up all filenames of a hardlink if it wants. I think this has something to do with Aliases and finding .app files even if they move around, or something.

I don’t think this is true anymore with APFS.

The checksum issue could be addressed by having the compiler generate the checksums in a "sidecar" file and have the build system depend on those.

Obviously this faces the "not every toolchain will support this" but you could have a switch to use checksums and continue to use the older approach by default.

You would have to call cksum instead of touch in your Makefile.

With a modules system (like C++20 is trying to embrace) you could in theory generate a vector of checksums (with subranges in the file) for each file and only recompile portions of the file that needed it. It's rather an accident of history that we use the granularity of a file at all.

Why not just use a complete checksum of the file?

The average project is what, a few MB? Less? Most of which is going to be cached after the first compile anyway.

Even on an enormous codebase, as long as you have an SSD or a nontrivial amount of RAM I can't see this being an issue.

You don't care if a file is newer - you care if it's different!

This is discussed near the end of the article. It works best when the file system itself stores a checksum in its metadata so it does not have to be calculated for each file for each build. It's not appropriate when your build may include dependencies based on other side effects besides file content. For example, sometimes you depend on the timestamp of an empty file, or the success or failure of another step based on e.g. a log message, to trigger other actions

Is it common to build over NFS? I can't immediately see a use case - collaborative editing or something? Even in that case, wouldn't it be easier to build on the box?

The other case seems valid in a sort of 'if your build intentionally makes use of mtime, you'll need to look at mtime'. It seems like an odd thing to do in the first place - I guess Makefile as deployment rather than for building?

A 470K LoC project I have here with >1000 files takes 0.04 seconds to do a full sha256sum traversal on my box from the cache. That's single-threaded.

If I drop caches, it takes approximately 1 second (from spinning rust, not SSD).

I do a lot of nfs builds at work as part of development on a proprietary OS. Some things will only nicely build on-OS, which for dev purposes is generally running in a non-local VM. The relevant git repos are massive enough (I'll frequently be building a small part of the tree, but still have to clone the whole thing), and the VMs disposable enough, that dealing with slower builds via an nfs mount from my workstation and/or homedir server is faster than repeatedly cloning.

You can probably argue that this is a consequence of bad tooling rather than any strength of nfs builds, but it is an example of a non-trivial number of developers frequently building over nfs.

For the median project, you could just rebuild everything from scratch every time and avoid the problem.

It's the huge projects with millions of files and tens of gigabytes of source and assets that need these optimizations the most, and that's also where checksumming is the most painful.

It's not as unrealistic or monstrous as it sounds. It happens in monorepos when you include all of a project's thousand dependencies (down to things like openssl and libpng).

At work, we frequently build in a Linux VM (using VirtualBox through Vagrant) from a macOS host, and the default shared folder does not support symlinks. We use NFS as a workaround.

NFS is getting so rare that some systems aren't even organized to accommodate "mount -o ro /usr" anymore.

NFS is widely used in HPC to mount user home directories on compute nodes.

I use NFS a decent amount, just not for anything like this, because even on a link with e.g. 5ms latency you end up with issues all over the place.

It seems like solving a problem that could be fixed more easily by just rsyncing or cloning the codebase. Storage is cheap.

hm! i wonder if ZFS exposes the hash to the outside world..

Check out the source sizes for Firefox or Unreal Engine. A project that's so small is a project that isn't putting big demands on the build system anyway, so that exception proves the rule.

Those looks like some good ideas, because currently I do use just mtime base (for programs with multiple files; many of my programs are only one file and so don't need to deal with stuff like that).

Ugh, these "* considered harmful" blog posts...

the only message I read is that someone wants to believe they're as smart as Dijkstra. /s

Applications are open for YC Winter 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact