The main reason why build tools exist is to perform transformations on a file set. The vast majority of the things build tools do that are useful are side effects.
These "side-effectful" functions are basically transducers; the reduction is performed on the file set instead of a sequence. They are mostly stateful transducers, but this does not fly in the face of functional programming. The opposite, in fact! They facilitate separation of concerns in a way that an immutable but global configuration map cannot.
It's true that you need to interact with the filesystem eventually, but the longer that can be deferred, the more complexity can be avoided.
Ideally I'd want to construct a functional data structure that describes my build process, and at the end pass it to a side-effectful function to produce a change to the filesystem. Boot appears to be side-effectful from the get-go, but perhaps I'm mistaken about how it operates?
Sorry, I should have mentioned that while boot, like any JVM build tool, does begin and end with the class path, we have spent an enormous amount of time experimenting with ways to mitigate the effects. We ended up with a system that provides many of the benefits of immutability while still living in the real world where files actually exist.
Here are some of the things boot provides:
1. We have "pods", which are separate Clojure runtimes in isolated class loaders in which you can evaluate expressions. The actual building occurs in these things. They are lexically scoped and can have a different class path than the main Clojure runtime where your build pipeline runs.
2. Files emitted during the course of the build are created in temp dirs managed by boot. There are a few different kinds of these temp dirs, one of which is lexically scoped. We also have temp dirs that are effectively immutable from a given task's point of view (we use a copy-on-write scheme to achieve this).
3. We make liberal use of hard links and directory syncing to emulate immutability wherever we can. Boot provides a kind of structural sharing with these hard links that really makes the pain of dealing with files go away.
4. We put a great deal of thought into how artifacts flow through the build pipeline, and how tasks that don't know anything about each other can cooperate to work on these files.
This is the most interesting part of boot for me, and I'll be making a complete writeup about it soon.
It sounds like Boot has had a lot of thought put into it, and I certainly welcome the idea of effectively immutable directories.
But it also looks like Boot dives head first into I/O, when I'd prefer a build tool that is a little more circumspect about complexity. While I welcome competition to Leiningen, and I'll certainly keep an eye on Boot, my initial impression is that it's heading in the opposite direction to where I'd want a build tool to go.
I'm of the mind that build tools aren't going to get better unless we forcibly insert instrumentation between our build tasks and the stateful resources upon which they depend. To that end, I'd like to hear more about both your Clojure runtime isolation and filesystem isolation mechanisms.
All JVM build tools are side-effectful from the get-go, and they really have to be. Consider the :dependencies and :source-paths keys in a Leiningen project.clj. The purpose of these is to manipulate the mutable class path. To have a JVM build tool that doesn't revolve around the class path will require a complete reinvention of the JVM ecosystem and all of the existing tooling (like in our demo we use the Google Closure compiler, which mutates all kinds of things–that would have to go), which is, I'm sure, never going to happen.
You need to eventually be side-effectful, but that doesn't mean you need to start side-effectful. The :dependencies in a Leiningen project map are just a data structure until they're passed to eval-in-project, which happens at the end of a chain functional operations.
One of the core ideas of Clojure is that we should try to favour simple solutions over complex ones. Side-effectful functions are the some of the most complex tools we have, and while they are necessary eventually, it would be nice to have the majority of the code-base work with simple data structures, and push out the complexity of I/O to the edges of the application.
Side-effects aren't necessarily bad, it's composability that is good. Purity is a composable property, but there are plenty of composable side-effects.
For example, let's ignore all other effects besides file IO (especially ignoring internet IO). Further, let's assume that our only IO operations are `(read path) => data` and `(write path data) => nil`. Let's also assume that both of these operations are atomic (ie you can't perceive a half-written file). If a build task attempts to read a file that doesn't exist, that task pauses itself. When a file is written, any task waiting on the file are resumed. If you re-write a file, it has to exactly match the already written file, or the build fails. To kick-off a build, you wait on one or more files and then start one or more tasks.
Viewed this way, the file-system is a monotonic logic variable. Yes, the programming model is effectful, but there is a composable property: build repeatability. Just as you can compose arbitrary pure functions and get a pure function out, you can take any two arbitrary graphs of these constrained IO build tasks, compose them together, and the resulting larger graph will also be a repeatable build.
There's certainly a lot you can do to constrain I/O, but removing side-effects when possible will always be better than merely restricting them.
I do like the idea of immutable files and repeatable builds, but I don't think this negates the benefits of maximising the time you spend working with data.
For instance, there might be a task that automatically cleans up some files, but you want to keep those files around. If the tasks produce data structures that describe the build process, it's much easier to intercept and prevent or reorder the cleanup process. Functions, especially ones that are Turing-complete, are notoriously opaque.
> removing side-effects when possible will always be better than merely restricting them
I disagree. Effects are a very natural mental model for a great deal of problems and constraining yourself to purity is both impractical and quickly experiences diminishing returns.
Furthermore, if you can intercept effects, you can impose purity upon them. For an extreme example, consider application virtualization and containers such as Docker. By intercepting the system call table, you can create a "pure" filesystem from the view outside the container. At the other extreme, take a look at "extensible effects" and the Eff language, which lets you stub any subset of the effects available down to the individual expression!
> If the tasks produce data structures that describe the build process, it's much easier to intercept and prevent or reorder the cleanup process.
If you intercept all file IO, you can recover the same data. The only difference is whether or not you know that data upfront.
> Functions, especially ones that are Turing-complete, are notoriously opaque.
This is true! However, there are a great many build processes that do not know what they depend on or what they will produce until they do some Turing-equivalent work. For example, scanning a C header to find #include statements.
Rather than try to shoehorn all data in to a declarative model, we need both 1) fully declarative and 2) the ability to recover a declaration from the trace of an imperative.
An example of this trick, employed manually, is the notorious .d Makefiles. The C compiler finds all the dependencies, produces a submake file with the .d extension, then make-restarts recursively using the new .d file as part of the dependency graph. However, it's a very unnatural way to think about the problem and it leads to complex multi-pass build processes that are necessarily slower. Instead, the dependency graph could be produced as a side-effect of simply doing the compilation and that graph could be used as part of a higher-level declarative framework.
> Effects are a very natural mental model for a great deal of problems and constraining yourself to purity is both impractical and quickly experiences diminishing returns.
I can't personally recall a problem where purity was feasible but impractical, though I can think of a few examples of the opposite.
> If you intercept all file IO, you can recover the same data.
Yes, if you record all I/O, you could restore files that have been deleted by a previous task.
However, it seems rather more elegant, and more efficient, to prevent the files from being deleted in the first place.
> Rather than try to shoehorn all data in to a declarative model
That isn't what I'm trying to say. There will inevitably be some cases where you need side-effectful I/O.
It's more that I'd rather see a solution start simple and pull in complexity as necessary, rather than start complex and attempt to work back to simplicity with constraints.
The problem with the read-everything-into-memory approach is that this is not how the JVM ecosystem works. Things written for the JVM use the classpath primarily, and things in memory secondarily if at all.
We can't control the fact that the CLJS compiler, for example, is looking for source files on the classpath instead of in some FileSet object proxy. If we admit the use of tools written by the Java community at large we suffer by adding another leaking half-abstraction to the mix.
We actually did some experiments with fuse filesystems but the performance is just not there yet. When fuse performance becomes comparable to Java NIO it may become a viable option, and would solve all of these problems. You could then have a "membrane" approach, where the JVM is only manipulating a filesystem proxy, and you have complete control over when and how to reify that and write to the filesystem.
Reading everything into memory wasn't supposed to be a complete solution. Not everything can be that simple. However, it seems to me to be better to start from a simple base and add complexity in as necessary, then start from a complex base and try to achieve simplicity.
But let's run with the idea of loading everything into some immutable in-memory data structure, just to see where it goes. So long as we write everything in Clojure we're fine, but the moment we start hitting things adapted for the JVM, such as the CLJS compiler, we run into problems as you point out.
However, it's not too hard to conceive of possible solutions. Let's start with a simple, but naive way around it. We'll take the files in memory, write them to a temporary directory, and then generate a CLJS compiler with a classpath pointing to that directory. When the compiler is done, we take the result and load it into memory again.
Again, this is solution that aims for simplicity rather than performance, but optimisations immediately suggest themselves. If the files exist on disk, we symlink them or point the classpath directly at them. If we don't need the CLJS output file's content, we can defer loading it into memory.
Haha, yes! Now we're cookin'! The "simple, but naive way" you describe above is pretty much the way boot does things. I'd say you could look at the boot cljs task to see this but setting up the environment for the CLJS compiler is pretty tricky so the code there isn't as clear and elegant as I'd like.
In boot tasks don't fish around in the filesystem to find things. Not for input nor output. Tasks obtain the list of files they can access via functions: boot.core/src-files, boot.core/tgt-files, et al. These functions return immutable sets of java.io.File objects. However, these Files are usually temp files managed by boot.
Boot does things like symlinking (actually we use hard links to get structural sharing, and the files are owned by boot so we don't have cross-filesystem issues to worry about), and we shuffle around with classpath directories and pods.
So stay tuned for the write-up of the filesystem stuff, I think it might be right up your alley!
It sounds like there's a lot in Boot I'd like, particularly in how it deals with the filesystem. I'm still not convinced about the design, but it's clear I don't know enough about it to make a decision on it.
If nothing else, I'm sure there will be parts in it I'll want to steal ;)
I realised I might not have been very clear in my previous comment. Let me see if I can improve it with an example.
Let's forget about all other considerations and instead consider the simplest possible build system we can conceive. This build system should take a directory structure of source files, and produce a directory structure of output files.
If our sole consideration is simplicity, we might construct a build system like:
So we take every file in the current working directory, read everything into memory, perform some functional transformation that produces a data structure of output files, then write that to disk. This minimises I/O, and gives us a functional data structure to play around with.
It's a naive approach, and one made without regard for memory or efficiency, but given that the amount of memory on a modern machine is far larger than the source directory is likely to be, it actually seems feasible.
However, we can also consider optimisations that don't alter the behaviour. For instance, we could only read in files when their contents are accessed. In order to protect against changes, we could check the modification date, and abort if it changes. It's a compromise, but a small one.
We might also conceive of a system where the contents of the file are memory mapped, or held in some temporary file, or any number of clever ways to avoid keeping the file in memory while not breaking the integrity of the data structure.
This is just a toy example, and lacking in many areas like network I/O, but it's easier to start simple and add complexity when necessary, than it is to start from an assumption of complexity and try to work backward to simplicity. This is why I think it's incorrect to start with side-effectful functions, because that means starting from complexity.
I think that the immutable data structure of lein is largely superficial. Once the final project map is built, all bets are off for the effectful behavior of downstream tasks/plugins. The only clear benefit from it is that you can capture a "complete" configuration with `lein pprint`. However, it does introduce significant cost in terms of the declarative configuration wack-a-mole, where you never quite know what magic ^:replace or :special-key to include to get what you want.
If you're not going to attempt to fix this superficiality by pushing immutable data deeper in to the system, then it makes perfect sense to discard it completely.
Also, in boot we don't want to obtain a complete configuration before tasks run, because tasks can participate in the process. In boot a task can add dependencies or call other tasks, etc. This is why we need only one abstraction (tasks).
Just like in a Clojure program you don't have a complete configuration of values in variables before the program runs, because it's the program that creates those values. Boot figures so much out on its own that `boot pprint` isn't even a thing you would want.
Consider the hoops that needed to be jumped through to get the Maven wagons system implemented in Leiningen. In boot it's a non-issue–the environment is dynamic so you can just install wagon deps and then in the next expression install the deps and repositories that depend on that wagon. We didn't need to make any changes to boot to accommodate them.
I still don't understand what the "task abstraction" is or what it provides. It seems to me that it's simply a Clojure function with a corresponding command line interface. Is that fair? Does it do something else too?
If it's just a function, I don't know why command-line support is valuable. For interactive use, the Clojure REPL is just fine. For automated use, you only need a single shell utility like perl or awk to evaluate an expression or run a script.
Yes, in the post I didn't get too technical with the treatment of tasks; I'll elaborate a little here. First, the command line thing.
You're right that the command line isn't strictly required to use boot, you can do everything at the REPL or perl/awk etc., as you pointed out. But for me it's really useful just ergonomically to be able to use command line arguments to configure ad-hoc builds because they can be very concise. Just like I probably wouldn't be super excited to use a Clojure shell instead of Bash, because Lisp is usually more verbose for the kinds of things I do on the command line. Consider:
the command line version is just nicer for that. When it's time to automate is when you'd put that in your build.boot and make it a new task.
This brings us to tasks. We used to describe them as "middleware factories" but Rich has provided us with a way cooler name: stateful transducers. The build process can be imagined as a transducer stack applied to a file set instead of to an async channel or sequential thing. The principal value the task abstraction provides is their process-building power.
A typical task definition looks like this:
(deftask foo
"This task does foo."
[...] ; kwargs/cli-opts
(let [state ...] ; local state
(fn [continue] ; middleware
(fn [event] ; handler
... ; build something, do work
(continue event) ; call continuation
))))
Please ignore "event", it's there for historical reasons. But like transducers, we have powerful ways to build processes from tasks that don't need to know anything about each other now. For example:
A key property of transducers is that they can also perform process control flow duties. The boot `watch` task, for instance, is a totally general-purpose way to to incremental-anything in boot. The `cljs` task doesn't have a file watcher in it, none of the other tasks do. They don't need it.
Another example is the `cljs-repl` task, which emits ClojureScript code when you start the CLJS REPL. This requires recompiling the JS file and reloading the client. This all happens automatically because the cljs-repl task can call its continuation whenever it likes, so it does that when you start the REPL. This means that your webapp code doesn't contain any REPL connecting code, so you don't have to think about removing it for production builds etc. The REPL connecting code is in there when you use the task, and not when you don't. Very clean.
Another interesting property of tasks is that they accept only keyword arguments. They do not take positional parameters. This means that partial application of tasks is idempotent, and that last-setting wins. For instance, given a function f that takes no positional parameters, we have:
(-> f (partial :foo "bar")) ==
(-> f (partial :foo "bar") (partial :foo "bar"))
and
(-> f (partial :foo "bar")) ==
(-> f (partial :foo "baz") (partial :foo "bar")).
This is pretty interesting because it gives us a nice way to manage global preferences. We have a macro called task-options! which can be used to globally apply options to tasks:
This macro actually does some currying and alter-var-root, replacing the value of the task var (deftask defines a var, of course) with a curried version. A cool thing about this is that the last-setting-wins property means that you can override these settings on the command line or in the REPL:
(boot (foo :bar "not-baz") (omg))
which would override the :bar option, but not the others.
This is probably long enough, hahaha! I'll hand the mic back to you now :)
> it's really useful just ergonomically to be able to use command line arguments to configure ad-hoc builds
It's largely also what contributes to "works for me" build environments... It's better to have a just one way to do it interface and discourage excessive tinkering with parameters. The more parameters, the more likely for your dev env to be unstable across individual checkouts or developers. I know it's idealistic, but I think we should strive for zero-arg builds, which oddly means not making it easier to configure them.
I'll have to think on all the other stuff you wrote, since it's not totally clear to me yet. I may ping you again after I noodle a bit.
Your point about repeatability is valid. As a policy matter we would never advocate building a project without codifying the process as zero-arg tasks in the build.boot file. You can see this in our own projects (eg. https://github.com/tailrecursion/boot-useful/blob/master/bui...). This project has one way to build the project jar file:
But we don't only use boot for repeatable builds! Boot is in a unique position in that it's on the intersection of application and environment. That is to say, boot can be used to "bootstrap" the application. With boot we can create sort of self-configuring applications, where the entry point of the application is the build.boot file. This is a very clear win, for example, when running Clojure applications in docker on Elastic Beanstalk, etc. (We'll write that up, too.)