These "side-effectful" functions are basically transducers; the reduction is performed on the file set instead of a sequence. They are mostly stateful transducers, but this does not fly in the face of functional programming. The opposite, in fact! They facilitate separation of concerns in a way that an immutable but global configuration map cannot.
Ideally I'd want to construct a functional data structure that describes my build process, and at the end pass it to a side-effectful function to produce a change to the filesystem. Boot appears to be side-effectful from the get-go, but perhaps I'm mistaken about how it operates?
Here are some of the things boot provides:
1. We have "pods", which are separate Clojure runtimes in isolated class loaders in which you can evaluate expressions. The actual building occurs in these things. They are lexically scoped and can have a different class path than the main Clojure runtime where your build pipeline runs.
2. Files emitted during the course of the build are created in temp dirs managed by boot. There are a few different kinds of these temp dirs, one of which is lexically scoped. We also have temp dirs that are effectively immutable from a given task's point of view (we use a copy-on-write scheme to achieve this).
3. We make liberal use of hard links and directory syncing to emulate immutability wherever we can. Boot provides a kind of structural sharing with these hard links that really makes the pain of dealing with files go away.
4. We put a great deal of thought into how artifacts flow through the build pipeline, and how tasks that don't know anything about each other can cooperate to work on these files.
This is the most interesting part of boot for me, and I'll be making a complete writeup about it soon.
But it also looks like Boot dives head first into I/O, when I'd prefer a build tool that is a little more circumspect about complexity. While I welcome competition to Leiningen, and I'll certainly keep an eye on Boot, my initial impression is that it's heading in the opposite direction to where I'd want a build tool to go.
One of the core ideas of Clojure is that we should try to favour simple solutions over complex ones. Side-effectful functions are the some of the most complex tools we have, and while they are necessary eventually, it would be nice to have the majority of the code-base work with simple data structures, and push out the complexity of I/O to the edges of the application.
For example, let's ignore all other effects besides file IO (especially ignoring internet IO). Further, let's assume that our only IO operations are `(read path) => data` and `(write path data) => nil`. Let's also assume that both of these operations are atomic (ie you can't perceive a half-written file). If a build task attempts to read a file that doesn't exist, that task pauses itself. When a file is written, any task waiting on the file are resumed. If you re-write a file, it has to exactly match the already written file, or the build fails. To kick-off a build, you wait on one or more files and then start one or more tasks.
Viewed this way, the file-system is a monotonic logic variable. Yes, the programming model is effectful, but there is a composable property: build repeatability. Just as you can compose arbitrary pure functions and get a pure function out, you can take any two arbitrary graphs of these constrained IO build tasks, compose them together, and the resulting larger graph will also be a repeatable build.
I do like the idea of immutable files and repeatable builds, but I don't think this negates the benefits of maximising the time you spend working with data.
For instance, there might be a task that automatically cleans up some files, but you want to keep those files around. If the tasks produce data structures that describe the build process, it's much easier to intercept and prevent or reorder the cleanup process. Functions, especially ones that are Turing-complete, are notoriously opaque.
I disagree. Effects are a very natural mental model for a great deal of problems and constraining yourself to purity is both impractical and quickly experiences diminishing returns.
Furthermore, if you can intercept effects, you can impose purity upon them. For an extreme example, consider application virtualization and containers such as Docker. By intercepting the system call table, you can create a "pure" filesystem from the view outside the container. At the other extreme, take a look at "extensible effects" and the Eff language, which lets you stub any subset of the effects available down to the individual expression!
> If the tasks produce data structures that describe the build process, it's much easier to intercept and prevent or reorder the cleanup process.
If you intercept all file IO, you can recover the same data. The only difference is whether or not you know that data upfront.
> Functions, especially ones that are Turing-complete, are notoriously opaque.
This is true! However, there are a great many build processes that do not know what they depend on or what they will produce until they do some Turing-equivalent work. For example, scanning a C header to find #include statements.
Rather than try to shoehorn all data in to a declarative model, we need both 1) fully declarative and 2) the ability to recover a declaration from the trace of an imperative.
An example of this trick, employed manually, is the notorious .d Makefiles. The C compiler finds all the dependencies, produces a submake file with the .d extension, then make-restarts recursively using the new .d file as part of the dependency graph. However, it's a very unnatural way to think about the problem and it leads to complex multi-pass build processes that are necessarily slower. Instead, the dependency graph could be produced as a side-effect of simply doing the compilation and that graph could be used as part of a higher-level declarative framework.
I can't personally recall a problem where purity was feasible but impractical, though I can think of a few examples of the opposite.
> If you intercept all file IO, you can recover the same data.
Yes, if you record all I/O, you could restore files that have been deleted by a previous task.
However, it seems rather more elegant, and more efficient, to prevent the files from being deleted in the first place.
> Rather than try to shoehorn all data in to a declarative model
That isn't what I'm trying to say. There will inevitably be some cases where you need side-effectful I/O.
It's more that I'd rather see a solution start simple and pull in complexity as necessary, rather than start complex and attempt to work back to simplicity with constraints.
We can't control the fact that the CLJS compiler, for example, is looking for source files on the classpath instead of in some FileSet object proxy. If we admit the use of tools written by the Java community at large we suffer by adding another leaking half-abstraction to the mix.
We actually did some experiments with fuse filesystems but the performance is just not there yet. When fuse performance becomes comparable to Java NIO it may become a viable option, and would solve all of these problems. You could then have a "membrane" approach, where the JVM is only manipulating a filesystem proxy, and you have complete control over when and how to reify that and write to the filesystem.
But let's run with the idea of loading everything into some immutable in-memory data structure, just to see where it goes. So long as we write everything in Clojure we're fine, but the moment we start hitting things adapted for the JVM, such as the CLJS compiler, we run into problems as you point out.
However, it's not too hard to conceive of possible solutions. Let's start with a simple, but naive way around it. We'll take the files in memory, write them to a temporary directory, and then generate a CLJS compiler with a classpath pointing to that directory. When the compiler is done, we take the result and load it into memory again.
Again, this is solution that aims for simplicity rather than performance, but optimisations immediately suggest themselves. If the files exist on disk, we symlink them or point the classpath directly at them. If we don't need the CLJS output file's content, we can defer loading it into memory.
In boot tasks don't fish around in the filesystem to find things. Not for input nor output. Tasks obtain the list of files they can access via functions: boot.core/src-files, boot.core/tgt-files, et al. These functions return immutable sets of java.io.File objects. However, these Files are usually temp files managed by boot.
Boot does things like symlinking (actually we use hard links to get structural sharing, and the files are owned by boot so we don't have cross-filesystem issues to worry about), and we shuffle around with classpath directories and pods.
So stay tuned for the write-up of the filesystem stuff, I think it might be right up your alley!
If nothing else, I'm sure there will be parts in it I'll want to steal ;)
I guess I wasn't very clear either. I didn't mean deleted files. I meant the metadata you'd get from a "declarative" build.
Let's forget about all other considerations and instead consider the simplest possible build system we can conceive. This build system should take a directory structure of source files, and produce a directory structure of output files.
If our sole consideration is simplicity, we might construct a build system like:
(defn -main [task & args]
(-> (read-dir (cwd))
(run-task task args)
It's a naive approach, and one made without regard for memory or efficiency, but given that the amount of memory on a modern machine is far larger than the source directory is likely to be, it actually seems feasible.
However, we can also consider optimisations that don't alter the behaviour. For instance, we could only read in files when their contents are accessed. In order to protect against changes, we could check the modification date, and abort if it changes. It's a compromise, but a small one.
We might also conceive of a system where the contents of the file are memory mapped, or held in some temporary file, or any number of clever ways to avoid keeping the file in memory while not breaking the integrity of the data structure.
This is just a toy example, and lacking in many areas like network I/O, but it's easier to start simple and add complexity when necessary, than it is to start from an assumption of complexity and try to work backward to simplicity. This is why I think it's incorrect to start with side-effectful functions, because that means starting from complexity.
I'm glad that people are putting thought into the cljs build process though, to this day it is still a particularly un-fun part of clojurescript.
If you're not going to attempt to fix this superficiality by pushing immutable data deeper in to the system, then it makes perfect sense to discard it completely.
Also, in boot we don't want to obtain a complete configuration before tasks run, because tasks can participate in the process. In boot a task can add dependencies or call other tasks, etc. This is why we need only one abstraction (tasks).
Just like in a Clojure program you don't have a complete configuration of values in variables before the program runs, because it's the program that creates those values. Boot figures so much out on its own that `boot pprint` isn't even a thing you would want.
Consider the hoops that needed to be jumped through to get the Maven wagons system implemented in Leiningen. In boot it's a non-issue–the environment is dynamic so you can just install wagon deps and then in the next expression install the deps and repositories that depend on that wagon. We didn't need to make any changes to boot to accommodate them.
If it's just a function, I don't know why command-line support is valuable. For interactive use, the Clojure REPL is just fine. For automated use, you only need a single shell utility like perl or awk to evaluate an expression or run a script.
You're right that the command line isn't strictly required to use boot, you can do everything at the REPL or perl/awk etc., as you pointed out. But for me it's really useful just ergonomically to be able to use command line arguments to configure ad-hoc builds because they can be very concise. Just like I probably wouldn't be super excited to use a Clojure shell instead of Bash, because Lisp is usually more verbose for the kinds of things I do on the command line. Consider:
$ boot cljs -usO none
(boot (cljs :unified true :source-map true :optimizations :none))
This brings us to tasks. We used to describe them as "middleware factories" but Rich has provided us with a way cooler name: stateful transducers. The build process can be imagined as a transducer stack applied to a file set instead of to an async channel or sequential thing. The principal value the task abstraction provides is their process-building power.
A typical task definition looks like this:
"This task does foo."
[...] ; kwargs/cli-opts
(let [state ...] ; local state
(fn [continue] ; middleware
(fn [event] ; handler
... ; build something, do work
(continue event) ; call continuation
"This task does bar."
(foo :bar "baz" :baf "quux")
(omg :hello "world")))
Another example is the `cljs-repl` task, which emits ClojureScript code when you start the CLJS REPL. This requires recompiling the JS file and reloading the client. This all happens automatically because the cljs-repl task can call its continuation whenever it likes, so it does that when you start the REPL. This means that your webapp code doesn't contain any REPL connecting code, so you don't have to think about removing it for production builds etc. The REPL connecting code is in there when you use the task, and not when you don't. Very clean.
Another interesting property of tasks is that they accept only keyword arguments. They do not take positional parameters. This means that partial application of tasks is idempotent, and that last-setting wins. For instance, given a function f that takes no positional parameters, we have:
(-> f (partial :foo "bar")) ==
(-> f (partial :foo "bar") (partial :foo "bar"))
(-> f (partial :foo "bar")) ==
(-> f (partial :foo "baz") (partial :foo "bar")).
foo [:bar "baz"
omg [:hello "world"])
(boot (foo :bar "not-baz") (omg))
This is probably long enough, hahaha! I'll hand the mic back to you now :)
It's largely also what contributes to "works for me" build environments... It's better to have a just one way to do it interface and discourage excessive tinkering with parameters. The more parameters, the more likely for your dev env to be unstable across individual checkouts or developers. I know it's idealistic, but I think we should strive for zero-arg builds, which oddly means not making it easier to configure them.
I'll have to think on all the other stuff you wrote, since it's not totally clear to me yet. I may ping you again after I noodle a bit.
"Build project jar file."
(comp (pom) (add-src) (jar) (install)))
But we don't only use boot for repeatable builds! Boot is in a unique position in that it's on the intersection of application and environment. That is to say, boot can be used to "bootstrap" the application. With boot we can create sort of self-configuring applications, where the entry point of the application is the build.boot file. This is a very clear win, for example, when running Clojure applications in docker on Elastic Beanstalk, etc. (We'll write that up, too.)
Yes. I would love to see a writeup on that.