Hacker News new | past | comments | ask | show | jobs | submit login
Content based change detection with Make (andydote.co.uk)
54 points by pondidum on Sept 20, 2022 | hide | past | favorite | 40 comments



This computes a single hash based on the contents of all files. I suppose it's appropriate for languages/build systems that do not have file-level compilation. Otherwise, it would be more efficient to keep hashes per file and re-build only those that have changed.

A criticism on this implementation: The "state" directory is never cleaned; new states are always added. Therefore if you go back to a state (hash) that you had already built in the past, you will not be able to build it again.

And there is no need to create separate shell scripts, when you can have all the relevant code inside your Makefile. Presumably those are not going to be called independently anyway.

As I've written on previous discussions:

Make by default uses the file change timestamp to trigger actions. But this is definitely not the only way, and you can code your Makefile so that rebuilds happen when a file's checksum changes. IIRC, the GNU Make Book has the code ready for you to study... Or, you might get more clever and say "when only a comment is changed, I don't want to rebuild"; file checksums are not the correct solution for this, so you can code another trigger.


> Make by default uses the file change timestamp to trigger actions. But this is definitely not the only way

Can you point to some documentation for this? I haven't been able to find anything.


I wrote "the GNU Make Book has the code ready for you to study."

I'm currently traveling, without access to my books, but a quick online search at the contents of this book https://nostarch.com/download/GNU_Make_dTOC.pdf shows that p.82 has the code for "Rebuilding When a File's Checksum Changes".

The GNU Make Book by John Graham-Cumming, No Starch Press, April 2015, 256 pp., ISBN-13: 978-1-59327-649-2


ccache is one approach. It's orthogonal in a way, but basically solving the same problem from a different angle.


Correct, the state directory isn't being cleared. As this is mostly aimed at making ephemeral build agents faster, this shouldn't matter much in practice.

Having said that, there is definitely some work to do on keeping the remote storage somewhat clean.


I did this long ago as part of my "Mr. Make" series. It's here: https://www.cmcrossroads.com/article/rebuilding-when-files-c...

And (self promotion) it is in my book: https://nostarch.com/gnumake (pg 82)


I paid full price for this book on No Starch, but I recently purchased it again as part of the "Linux by No Starch"[1] bundle that's going on now. I think that bundle would be appreciated by a lot of the crowd here. It's in the $10 and up tier.

[1]: https://www.humblebundle.com/books/linux-no-starch-press-boo...


Thanks!


Thanks for those links, I will check them out tomorrow!


Where can I purchase this fine text in dead-tree media?


I don't know if they plan to reprint it or not. Sorry.


Bravo! I haven't read the code carefully yet to fully understand the details of the implementation, but this is brilliant hackery.

As someone who has recently developed a focused interest on build systems, I have noticed the large vacuum in the space for a content-based Make-like that has a similarly low of a barrier of entry. HN user bobsomers said it well the other day:

"There is a much tighter, cleaner, simpler build system within Bazel struggling to get out. A judicious second take at it with a minimal focus, while taking some of the best ideas, could be wildly successful I think."

https://news.ycombinator.com/item?id=32831890

I think 'redo' is probably the closest contender these days.

apenwarr's redo implementation is probably the most popular, the most mature, and has been built with a lot of smart design choices:

https://redo.readthedocs.io/en/latest/

Pluggable caching, which AFAIK apenwarr/redo does not yet support, would be a great addition, to support remote, shared caching.


Thank you! Yes, I agree there is a simpler build system inside Bazel waiting to get out; I'll definitely be looking into redo.


This is clever but truly content based change detection in make would seriously fix a bunch of issues. I'm sort of surprised it hasn't been done already.


You might enjoy Tup[1] if you've not checked it out before.

[1]: https://gittup.org/tup/


I've seen it but:

1. I find the syntax pretty offputting. 2. Make is found everywhere so a drop in replacement is a useful feature.


Probably because it would fundamentally change how make works (i.e. without a database).


Doesn’t mean you can’t maintain the database with Make :) Prototype:

  .FORCE:
  .PRECIOUS: .sha256/sum/%
  .sha256/tmp/%: % .sha256/tmp .FORCE
          sha256sum < $< > $@
  .sha256/sum/%: .sha256/tmp/% .sha256/sum
          cmp $< $@ >/dev/null 2>&1 || mv $< $@
  .sha256/tmp .sha256/sum:
          mkdir -p $@
  
  .SUFFIXES:
  %.o: .sha256/sum/%.c
          $(CC) $(CPPFLAGS) $(CFLAGS) -c -o $@ $*.c
I expect you could get rid of most of the non-POSIX features by using more complex recipes, probably exiling them into separate shell scripts. (The main limitation is that V7/POSIX/BSD-style suffix rules don’t let you specify a rule for producing ANYTHING.foo from ANYTHING, whereas V8/GNU-style pattern rules do.)


Make only works “without a database” in a pretty loose sense, it relies on the file system to store the (very unreliable) data it relies on.

It could use xattr to store its content tags.


Assuming the filesystem supports xattr is an even bigger minefield.


We use all kinds of tools that require a database that is effectively hidden from us. (e.g. git). I don't think this is a significant blocker or problem.


I might be misunderstanding but I think this fails when you revert to a previous revision of a file as the hashing will revert back to an older timestamp, i.e.

  make
  # compiled foo.ts
  vim foo.ts
  make
  # compiled foo.ts
  git stash
  make
  # nothing to do


Yes, you're right if you don't have remote storage enabled, otherwise it should fetch the missing assets back again.

The main use case for this is ephemeral build hosts to share the cache, but I want to work all these issues out too, so thank you for the feedback


Here are two other approaches that I’ve stumbled on for doing similar things:

- https://www.cmcrossroads.com/article/rebuilding-when-files-c... - http://olipratt.co.uk/rebuilding-makefile-targets-only-when-...


> It turns out you can do all of this with sha356sum in a one-liner:

Wait what

  | xargs -0 sha256sum \
Oh, just a typo.


Oops, I'll fix that. Thanks


This is amazing, I also learned a thing or two about what good shell programming looks like.


If you're interested in more shell stuff, read the Bash Manual. It's not very long and it's quite enjoyable: https://www.gnu.org/software/bash/manual/bash.html

Run all your scripts through https://www.shellcheck.net/ (you can install it locally too) and correct all errors it finds, click the explanation pages to understand why. In future, improve your style so you don't generate errors.

Here's some more I've found useful: - https://tldp.org/LDP/Bash-Beginners-Guide/html/index.html - https://tldp.org/LDP/abs/html/index.html


I would also highly recommend shellcheck (https://www.shellcheck.net/) for useful error messages and warnings!


How would that work in the C/C++ world? For example, a.cpp #include's <a.h>; a.h doesn't produce any compilation artifacts. We change a.h but not a.cpp; since a .cpp is unchanged, there's nothing to do in the build?


Your make file should be be specifying that the rule depends on both a.h and a.cpp (the same as all makefiles should).


Am I correct to assume that if I forget that in Makefile, the compilation would still work fine?

Also, do I need to include all of <a.h>'s includes, and of these includes' includes?


Yes and potentially yes.

Normally when manually writing Makefiles people assume the host system doesn't change as it's painful and the only alternative is to sacrifice portability. If you want a more reliable/portable/stable build the only real option is use a build system like meson/cmake/autotools/basel/... to generate the appropriate Makefile for the current machine (and once you start doing that you might as well stop using make and use ninja instead).


Normally you ask your compiler to generate the dependency graph on first run, then include it from your Makefile (make already knows to rerun itself when the Makefile deps change).


ccache https://ccache.dev/ is great for C/C++, but it would be excellent if it exposed a more generic interface for use in the way this article describes.


Maybe not as portable, but isn't inotify better for this?


Inotify (and various equivalent change notification mechanisms on other systems) is probably ideal, but needs a fallback for a cold start. If a cold start isn’t the common case (as it is in Make), using slower cryptographic hashing instead of less reliable mtimes might be a good tradeoff. But I don’t think you could marry this to Make without major surgery. (Tup, mentioned elsethread, is this + automatic dependencies based on syscall tracing.)

This is not what the article is about, however,—it’s about building a build-input-addressable artifact cache that can be shared across machines, à la Bazel or Nix. (That Nix is a build system—and NixOS is a distro built on said build system, the same way that Gittup is built on Tup, the BSDs on Make, or Gentoo on its thing—seems to be a well-kept secret. The distinguishing idea that in that case “build artifact cache” = “binary repo” is brilliant, though.)

What I don’t get is what purpose Make serves in the context of the article: when the build process turns a whole bunch of files into one with no intermediate steps or separate compilation, it seems a plain shell script would do just as well as Make + auxiliary shell scripts. You could perhaps obtain some more structure (less strict source ordering?) with Make, but the article doesn’t seem to do that.


To address the purpose of make, this was written with a large multi workspace typescript project in mind: in the real project, there are many more make targets with dependencies on each other, and we don't want to waste time rebuilding the different workspaces if they haven't changed.

I wasn't sure whether including all that extra information in the post was worth it or not, so I hope this answers your question


Any non-trivial project would quickly reach RLIMIT_NOFILE as inotify is non-recursive, so you'd need to open a new fd for each monitored file.


Horrific! This is the sort of poorly thought out hack my colleagues would do.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: