Hacker News new | past | comments | ask | show | jobs | submit login
Idiomatic Clojure without sacrificing performance (bsless.github.io)
150 points by ceronman 3 months ago | hide | past | favorite | 79 comments



A datapoint from a self-funded entrepreneur, for whom an app written in Clojure and ClojureScript provides a living:

Before you start discussing whether a language is "slow", check if perhaps it is "fast enough".

Apart from trying not to write stupid code, I don't even optimize anymore, because coupled with ridiculously fast servers (bare-metal hardware from Hetzner in my case), it's a waste of time. Things run great, and I'd much rather invest in building complex app features that customers will pay for, and have flexible living code that I can easily rearchitect as needed.

The last time I had to optimize Clojure code was when I wrote a search engine that had to respond to incremental searches in a fairly large E-commerce product databases. But that was also about 12 years ago, so computers were quite a bit slower, and JVMs were not as good as they are today.

I do tend to use transducers quite a bit, because they provide great composability and structure the code nicely, while providing a good performance boost as well (you avoid creating lots of intermediate data structures, which not only saves CPU, but also reduces cache, memory and GC usage) — but I don't worry about performance too much. It's just not a problem that appears on the radar.


> Apart from trying not to write stupid code, I don't even optimize anymore, because coupled with ridiculously fast servers (bare-metal hardware from Hetzner in my case), it's a waste of time.

Inquiring mind wants to know: how do you deploy your Clojure (Script) webapp on your bare server? Just an "uber war" that you drop there? (bare servers at OVH here btw but I know Hetzner is good too)


Yes, I used a single uberjar for the first year or so, but then it got a bit more involved, because of delivery optimizations (nginx, brotli, precompressed files, versioned immutable static content with infinite cacheability, etc). So these days it's an uberjar (AOT-compiled) plus a bunch of files, essentially.

The servers are managed using ansible, and I also have a terraform setup for quickly spinning up a development cluster "in the cloud". Ansible takes a list of servers either from a static file or from terraform.


To add to this, you can also compile an uberjar into a static executable with Graal VM. I use that to build CLI apps for example, and it works wonders.


Keep in mind that your performance for long-running processes will get worse compared to just running a Uberjar as you're missing lots of optimizations going with GraalVM instead of Hotspot


*Unless you're paying for the commercial version

Worth noting, GraalVM's JVM implementation has a different JIT compiler which seems to give better performance for highly dynamic and polymorphic languages like Clojure and Scala


Yes, but in most cases GraalVM would be a preferable choice (https://www.inner-product.com/posts/benchmarking-graalvm-nat...).


Haven't read the full article but not sure it's accurate, the benchmarking script (https://github.com/inner-product/http4s-native-image/blob/ma...) doesn't even use the server VM and missing other basic optimizations you'd use if you were running a Uberjar in production.


So what does your app do? And what have been the biggest drawbacks from your language choice?


It's a PLM/ERP type app for electronics — https://partsbox.com/

Drawbacks, hmm — I'm not sure, really, there is no silver bullet, and I can't see significant problems with the language itself that another language would magically solve. I'm pretty happy with Clojure and ClojureScript and some advantages are hard to replicate elsewhere (such as business logic code being written only once and then shared between Clojure and ClojureScript).

One thing I noticed is I got spoiled by Rich Hickey solving problems for me, and I'd really like him to continue doing so. For example, transducers and core.async were great, but spec isn't quite finished yet. Also, reporting status and anomalies (as in, status of operations, computations, including errors/exceptions) and passing it through channels or foreign encodings is a problem I had to solve myself. I'm pretty sure Rich could come up with a better solution.

The biggest problems that I encounter everyday are not with the language itself, but with the ecosystem. Many libraries could use more maintenance. And the really big problems aren't language-related at all. For example, RethinkDB shutting down. I also wish I didn't have to write the full FoundationDB integration ("language binding") myself.


The biggest performance issue Clojure has, which isn't mentioned in the article and is fundamentally unsolvable, is that it misses the CPU cache - a lot.

The data structure that drives the immutable variables, the Hash-Array Mapped Trie, is efficient in terms of time- and space- complexity, but it's incredibly CPU cache-unfriendly because by nature it's fragmented - it's not contiguous in RAM. Any operation on a HAMT will be steeped in cache misses, which slows you down by a factor of hundreds to thousands. The longer the trie has been around, the more fragmented it is likely to become. Looping over a HAMT is slower than looping over an array by two or three orders of magnitude.

I don't think there is really a solution to that. It's inherent. Clojure will always be slow, because it's not cache friendly. The architecture of your computer means you physically can't speed it up without sacrificing immutability.


It is mentioned in the article, one of the last optimizations done there was switching to array. Also, Clojure’s HAMT was designed with CPU cache in mind and it’s performance characteristics don’t degrade over time. Immutable data structures will be slower than arrays - that’s true, but Clojure standard library works on arrays just fine, as demonstrated in the article.


You're right - I missed that, the author does mention arrays having better access characteristics, although he doesn't really explain why HAMTs specifically are slow.

How does Clojure's HAMT avoid fragmenting over time?

> Clojure standard library works on arrays just fine, as demonstrated in the article.

Right, but then you don't have immutability - so you lose all the guarantees that you originally had with immutable-by-default.


> although he doesn't really explain why HAMTs specifically are slow

Hello, author here :)

HAMTs are certainly slower, for "churning" operations, i.e., lots of updates, which is where Clojure exposes the transient API, which gives you limited localized mutability (some terms and conditions may apply)

Where iteration is concerned, they standard library implementation is pretty good. It relies on chunks of 64 element arrays which store the keys and values contiguously. Thus, APIs which expose direct iteration, Iterator and IReduce(Init) operate on these chunks one at a time. It isn't as fast as primitive arrays, but it's pretty fast.


> but then you don't have immutability - so you lose all the guarantees that you originally had with immutable-by-default

Not really, for example take the following solution with execution time mean : 1.678226 ms

    (defn smt-8''' [^ints times-arr]
      (loop [res (transient []) pointer-1 (int 0) pointer-2 (int 7)]
        (if (< pointer-2 (alength times-arr))
          (let [start-element (aget times-arr pointer-1)
                end-element (aget times-arr pointer-2)
                time-diff (- end-element start-element)]
            (recur
             (if (< time-diff 1000)
               (conj! res [(mapv #(aget times-arr (+ pointer-1 (int %))) (range 8))
                           time-diff])
               res)
             (inc pointer-1)
             (inc pointer-2)))
          (persistent! res))))
It only requires the input being an array, but it will return an immutable persistent vector of vectors. So it is very easy to selectively go down to an array in the performance critical sections while being immutable and idiomatic in most places.

> he doesn't really explain why HAMTs specifically are slow

Take this solution using HAMT vectors with execution time mean : 23.567174 ms

    (defn smt-8' [times-vec]
      (loop [res (transient []) pointer-1 (int 0) pointer-2 (int 7)]
        (if-let [end-element (get times-vec pointer-2)]
          (let [end-element (int end-element)
                start-element (int (get times-vec pointer-1))
                time-diff (- end-element start-element)]
            (recur
             (if (< time-diff 1000)
               (conj! res [(subvec times-vec pointer-1 (inc pointer-2))
                           time-diff])
               res)
             (inc pointer-1)
             (inc pointer-2)))
          (persistent! res))))
Now it is 14 times slower than my prior one which iterates over an array, but it is still pretty fast, so that's OP's point, things are often sufficiently fast, and when they are not you can selectively optimize those parts, and easily too, see how similar my two solutions are from one another.

Edit: These assume you've `(set! unchecked-math true)` prior to compiling the functions.


I’m familiar with the implementation of HAMTs - if anyone wants to study one in C I recommend https://github.com/Jamesbarford/hash-array-mapped-trie or my polymorphic fork of it https://github.com/fromheten/hash-array-mapped-trie-poly.

Are there any other key/value data structures where insertion and retrieval are less than O(n) in complexity, but where the memory layout is better ordered for cache hits during searches? Maybe good old red-black trees?


I don't think there's an immediately available answer here. The wide branch factor in Clojure's implementation is very iteration-friendly, less so for updates.

Worth checking out are BTrees and Hitchhiker trees, but I think a definitive answer will be implementation dependent even in those cases, i.e. one might win out over the other for a specific branch factor or other tune-able parameters


I wonder if the wide branching factor of them gives you some cache friendliness.

However, not every use case can benefit from cache optimization and you can use other data structures. It’s not super useful to make generalizations about performance that way.



While I do agree, with pointer chasing down a persistent data structure being basically the worst case use of a CPU, the ease of threading Clojure programs means you can often claw a lot of that penalty back.


Threading doesn't compensate for that degree of slowdown, and itself has overhead. You'll get something back, but not much.


Again, generalisations about performance like that is just not very useful. There are use cases where you get significant benefits from using persistent data structures.

Aeron comes to mind as an example in regards to high performance solutions that uses them. But a more fundamental reason is that immutability can be part of your domain logic. Think video games like braid, or business applications where you need to derive a lot of different views and go back and forth between them, or domains where data just is inherently immutable such as accounting and so on.


I don't really agree. Cache friendliness is almost always a relevant factor as soon as performance becomes an issue. I get what you're saying but as I see it immutability gives you architectural efficiency and in some cases space efficiency, but rarely processing speed.

> Think video games like braid

Braid doesn't use immutable data structures, it uses a historical record (and immutability would be incompatible with some of the mechanics).[1] The author of Braid is actually quite famous for his dislike of functional languages and immutability. He doesn't even like garbage collection because of the overhead it introduces.

Interestingly, he was talking about data structures for codebase representation (in the context of a compiler) a while back, and someone mentioned HAMTs. I'm definitely curious if they would work well there.

[1] https://www.youtube.com/watch?v=8dinUbg2h70


You’re picking out an admittedly bad example in order to miss the point. Different data structures have different tradeoffs. Not every problem lends itself to the same type of optimization, so it’s not helpful to make these generalizations.

Continuing with the example regardless, change records and persistent data structures have different performance characteristics. The former is going fast if you move incrementally between states, the latter enables arbitrary access, comparison and efficient in memory caching of multiple views.

It would be interesting to explore and measure the trade offs under certain use cases.


I understand your point. I'm saying: the subset of problems that benefit in a performance sense from immutability is very small. The vast majority of the time, cache misses slow down algorithms. That's a perfectly reasonable generalisation and I don't really understand why you think it's untrue.

Re: change records, I believe Braid uses a historical record of absolute states, not a series of diffs. The state at a particular time is recreated from a minimal representation (a couple of arrays). That's much more efficient than throwing multiple iterations of the state in a HAMT.


> The vast majority of the time, cache misses slow down algorithms. That's a perfectly reasonable generalisation and I don't really understand why you think it's untrue.

I don't say it is untrue. But not every use case lends itself to CPU cache optimization. Your access patterns might just happen to be arbitrary or fragmented from a layout perspective.

I would argue that this is a very common case, especially for IO bound applications that operate on tiny pieces of data and have deeply nested dispatch rules that each query some section of memory that you don't know it will query in advance.

Or in other words: The clearer your computation pipeline can be, the more you can benefit from CPU cache optimization.

> Re: change records, I believe Braid uses a historical record of absolute states, not a series of diffs. The state at a particular time is recreated from a minimal representation (a couple of arrays). That's much more efficient than throwing multiple iterations of the state in a HAMT.

You caught my interest, I will have to study this. I can't really imagine how it works from your explanation (my bad, not yours). I assumed when you said "change records" it meant that it would be suboptimal to access an arbitrary state (linear as opposed to constant).


In my experience you can usually achieve near linear speed up. My machine can run 24 threads.


Fair enough on scaling. But 24 is still a lot less than two to three orders of magnitude.


Yes, it is. I'd probably ballpark Clojure at 100x slower than the kinds of low level C# I usually write (gamedev). But threading C#, especially low level imperative C#, is so awful I often don't bother unless it's very important or there's an embarrassingly parallel element (genetic algorithms and simulating sound waves on a 2D grid are two cases I've pulled the trigger where both were true).

This leaves Clojure as 1/4 the overall speed, which seems about right. However that's based on a hugely unscientific gut feeling, because generally I don't touch Clojure if performance matters and I don't touch C# if it doesn't.

By the way, I've implemented persistent data structures on disk for a problem they were particularly useful for. If stalling for a cache miss feels bad, try waiting for an SSD :)


Having to use 24 cores to get to 1/4th of the performance of a single threaded C# application seems particularly awful.

Especially as C# is a relatively pleasant language to use.

It doesn't matter how good threading in clojure is if you don't even need to use it to beat clojure on another language (also: great scaling is pointless if you start so far behind).


Yes, hammering in a nail with a screwdriver is particularly awful. I love using Clojure where it's appropriate, but I'm not in the least bit tempted to use it for game dev.


Could you share an example program that does that?


Take that with a grain of salt, from experiments run by myself and others I can share this does not scale linearly.

Hope some performance experts can chime in but this might be related to data structure sharing between caches or allocation contention in the JVM.


If you've got some code, I'm happy to take a look.



I don't have access to the Zulip chat, but the other benchmarks are basically testing allocating in a hot loop. I'm not surprised that doesn't scale linearly, and it's certainly not representative of real world code I've ever written.

If you have code you wrote to achieve something and hit a wall with scaling, I'm happy to take a look.


Up to 4 threads I usually see linear scaling as well, but it begins to drop off afterwards, although I don't have a representative example at hand.

I'd like to see a good example if you have one available, most of my performance work was done in a single-threaded context until now


I don't really have anything off hand.

But code primarily slowed down by pointer chasing down big maps, which is MillenialMan's complaint and fits my own experience, will absolutely be sped up linearly.

A bunch of cores sitting around waiting for the next HAMT node to come in will not interfere with each other in the slightest.


This is entirely hypothesising but I don’t see how these wide trees are awful for the cache. In particular I don’t think they would be much worse than a flat array of objects—the problem is, I think, that objects are references and iterating the things in an array of pointers usually sucks.

For a HAMT, With depth (eg) 2, you should be able to prefetch the next node in your iteration while you iterate the current node of 64 elements. Actually iterating through the objects is going to suck but hopefully you can prefetch enough of them too. (Or maybe hotspot could online things enough to improve memory access patterns a bit to take advantage of ILP). There’s still the problem that you can’t read main memory sequentially except that often all the objects in a HAMT were allocated sequentially so you can read main memory sequentially (as allocation is typically just bumping a pointer, allocation-time-locality tends to correspond to address-locality)


If I understand you correctly, this i a general problem of functional data structures?

> Clojure will always be slow, because it's not cache friendly.

You always have the option to use the java data structures, for the cases this kind of optimization is needed.


Yes, this is a general problem with functional data structures. They have to be fragmented in order to share data. There's also the more nebulous issue that they encourage nesting to leverage the architectural benefits of immutability, which is a complete disaster for cache friendliness.

Replacing the critical path is an option, but that only works for numpy-style situations where you can cleanly isolate the computational work and detach it from the plumbing. If your app is slow because the inefficiencies have added up across the entire codebase (more common, I would argue), that's not an easy option.


You seem to be missing the point lukashrb made regarding using Java data structures. Your claim was "It's inherent. Clojure will always be slow" which is demonstrable false as you can use interop with your host language (Java, JavaScript, Erlang) to use data structure that don't suffer from this. Wrap how you're using them in Clojure functions and the usage code for this is idiomatic while also having cache friendly data structures.


I understand the point, I'm not arguing that you can't do that or speed up your code by doing that.

Python is also slow, but you can still go fast (in some cases) by calling into C libraries like numpy - but the performance is coming from C, not Python. Python is still slow, it's just not an issue because you're only using it for plumbing.

But Clojure is immutable-by-default, that's the point of the language - it gives you various guarantees so you don't have to worry about mutability bubbling up from lower levels. In order to wrap your heavy path you have to go outside the Clojure structure and sacrifice that guarantee. You do lose structural integrity when you do that, even if you attempt to insulate your mutable structure. The system loses some provability.


Calling C from Python is very different from calling Java code from Clojure. Clojure always runs on it's host (Java/JavaScript), so calling functions via interop usually speeds up everything with zero overhead, compared to calling C code from Python which does introduce overhead.

Everything in Clojure comes from Java, it's unfair to compare it to "performance is coming from C, not Python" as Clojure works differently.


That wasn't really my point. There is still a translation process you have to apply to cross the boundary between immutable and native data structures though, and that has its own overhead.


I have learned to embrace polyglot programming instead of trying to use the same language for every possible scenario.

Personally I see this language X is better than Y at Z a waste of time, because I would just use X and Y depending on what I have to do.


In general I would agree, but a significant part of Clojure's appeal is that it's immutable by default, because that allows you to make certain assumptions about the codebase. Introducing mutable datastructures means you can no longer make those assumptions, so it potentially has much wider ramifications than e.g. calling into C code from Python.


> If your app is slow because the inefficiencies have added up across the entire codebase (more common, I would argue), that's not an easy option.

This is where I would have to disagree, in my experience, that is less common. Generally there are specific places that are hot spots, and you can just optimize those.

Could be it depends what application you are writing, I tend to write backend services and web apps, for those I've not really seen the "inefficiencies have added up", generally if you profile you'll find a few places that are your offenders.

"Slow" is also very relative.

    (cr/quick-bench (reduce (constantly nil) times-vector))
    Execution time mean : 3.985765 ms

    (cr/quick-bench (dotimes [i (.size times-arraylist)]
                      (.get times-arraylist i)))
    Execution time mean : 775.562574 µs

    (cr/quick-bench (dotimes [i (alength times-array)]
                      (aget times-array i)))
    Execution time mean : 590.941280 µs
Yes iterating over a persistent vector of Integers is slower compared to ArrayList and Array, about 5 to 8 times slower. But for a vector of size 1000000 it only takes 4ms to do so.

In Python:

    > python3 -m timeit "for i in range(1000000): None"
    20 loops, best of 5: 12.1 msec per loop
It take 12ms for example.

So I would say for most uses, persistent vectors serve as a great default mix of speed and correctness.


A traditional hash table can also be pretty cache unfriendly. I wonder if there are any published measurements that compare these.


Partially unrolled node based data structures could help. Does clojure use them?


This does underline the weaknesses of Clojure and performance. For a typical language, the advice is (1) don't call expensive functions and (2) loop less.

For Clojure, we get the wonderful quote "There is a close transducer, partition-all, which has no step arity.".

One of Clojure's biggest weaknesses in practice is that breaking in to those functional structures to figure out where the time is being spent or to debug them is harder than in other languages. This is a natural trade-off of developing a terse and powerful language.

It is notable how much trouble Clojure has had over the years linking a bug back to a line number and a reason. Even with spec.


>>> One of Clojure's biggest weaknesses in practice is that breaking in to those functional structures to figure out where the time is being spent or to debug them is harder than in other languages. This is a natural trade-off of developing a terse and powerful language.

Not that hard if you use something like YourKit. There's also a quite good Clojure library https://github.com/hugoduncan/criterium .


The performance difference between (peek n) and (last n) for vector typed n strikes me as something that really ought to be handled by the implementation, not the programmer since the outputs are equivalent.

I really like clojure, like, a lot, but I might have taken an opposite than intended result from this blog post. It made me feel like the language has a number of performance footguns that don't necessarily need to exist.


It seems like an intentional design choice. didibus sorta explained it to me: https://news.ycombinator.com/item?id=28406315

It does feel like a tradeoff that was a bit too harsh. As the post shows, you can work around it... but at the end of the day my code ends up with a ton of (nth 0) and very few (first) calls. It feels a bit ugly.. and it doesn't feel "idiomatic" as in how you learn Clojure from a textbook. Some more discussion of it here: https://clojureverse.org/t/what-do-beginners-struggle-with/5...

I find the threaded stuff much easier to follow. Maybe it's mostly visual and familiarity

And as an interesting extra: https://clojureverse.org/t/x-x-auto-transducifying-thread-ma...


I agree. Also, avoiding Clojure’s collections entirely is a great way to get better performance, but it’s hard to argue that is idiomatic. This applies doubly to ClojureScript.


I personally feel that people for whom all the underlying JVM/Java stuff is deeply icky and unidiomatic are missing out on a lot. Clojure has some deeply held principles but also a whole heap of pragmatism and that’s what’s always kept it useful to me.


That is one of the reasons I dislike guest languages, the communities tend to have a culture of rebuilding the underlying stack, as if not appreciating that it is what allows their favourite language to exist in first place.


What is a "guest language", haven't heard this term used outside of GraalVM before?

From context, it's a language that uses the platform of a host?

IE, JVM languages like Kotlin/Scala/Clojure when Java was the initially intended one? Or CLR languages like VB/F# which are second-class in comparison to C#?

If it is this, the one interesting thing I have to note is that I've found that sometimes Kotlin/Scala will produce better bytecode and performance than what I would have written naively by hand.

Just due to them emitting horror-inducing code/optimizations that is great for performance but never meant for human eyes.


Yes, the languages that are foreign to the platform and require additional tooling to be usable, or have to generate code to fake host semantics.

JVM and CLR aren't the same thing in regards to guest languages.

JVM was designed for Java, and everyone has to fake being Java when generating bytecode.

CLR was designed just like WebAssembly, to be language agnostic runtime, initially released with VB, C#, J#, C++ and a couple of .NET versions of Oberon, Eiffel, Lisp, Scheme, COBOL.

Guest languages than tend to pick their own abstraction, which work great on year zero, and then get problematic as the platform decides to go into another direction, or the guest language decides it wants to break free and target multiple hosts.

Kotlin is a great example, due to its marriage with Android, its Java compatibility is caped to Android Java subset, or one needs to make use of KMM.

Additionally Java developer cannot easily consume Kotlin APIs based on co-routines, and it remains to be seen how they will integrate them with Loom.

Or the reboot they were forced to do in Kotlin/Native because it wasn't originally designed to be compatible with JVM semantics.


Ah, I see what you're saying now.

  > Or the reboot they were forced to do in Kotlin/Native because it wasn't originally designed to be compatible with JVM semantics. 
sigh

Yeah. BUT! I am very bullish on Kotlin. Think it's probably the most exciting language evolving right now.

I went on a few-tweet minirant here about why:

https://twitter.com/GavinRayDev/status/1443279425311805440

But the tl;dr is that:

  - There is Jetpack Compose currently, for Desktop, Web, and Android

  - And Kotlin Native putting a large portion of resources into Skia bindings (JetBrains calls the lib "Skiko" https://github.com/JetBrains/skiko)
It's very clear (and there are some employees which have confirmed this IIRC) that they are working on "Jetpack Compose Everywhere" that runs on iOS as well, from a single codebase. Skiko Kotlin Native lets you use Skia to render on the web as well, if you compile to WASM target.

There's the big Kotlin event going on right now, where they just announced the new WASM backend and changes in their compiler + IR commonizing/restructuring ("K2").

- https://blog.jetbrains.com/kotlin/2021/10/the-road-to-the-k2...

- https://www.youtube.com/watch?v=-pqz9sKXatw

The net result is that you wind up with a single language that you can use to write your backend API, your UI code (Jetpack Compose app deployed across Web/Android/iOS/Mac/Win/Linux, or transpile to JS/TS if you just want a web app, etc) and with Kotlin Native even your native, low-level code to integrate with existing C/C++ etc ecosystem.

KN already does automatic bindgen for C and Swift headers, they have direct C++ interop (like Swift does) on their future roadmap too:

(Ctrl+F "interoperability")

https://kotlinlang.org/docs/roadmap.html#roadmap-details

All of this is mostly possible already -- I can do the same thing using IE Java, GraalVM, and a transpiler like Google's j2cl or bck2brwser (which is what Gluon uses for JavaFX on the web). Including the "native" part.

IE, here's a contribution I made to get GraalVM producing native binaries using Skia from the JVM + JNI Jetbrains Skia library (and even invoking my Java app methods as a .dll from C++)

https://github.com/HumbleUI/JWM/issues/158

But Kotlin is pushing the hardest to make this whole platform/stack from native <-> desktop <-> mobile <-> browser a seamless, unified experience.

And you can feel it, when you try to do the "whole stack, every platform, one language" thing.

Sorry for the rant and wall of text!


Or one just uses JVM implementations instead, and yes they exist for all Kotlin target flavours, including iOS.

JetBrains wants to go big with Kotlin, and be a language vendor, lets see how they manage outside Android, where they aren't the godfather chosen language.


I've only used Kotlin for serverside development (Redhat's Quarkus framework) and experimentally Kotlin Native, though it definitely still has the reputation as "That Android language" for sure


Clojure explicitly embraces the host, and it is also one of the first thing people criticizes haha.

    Language built for platform vs language ported-to platform
    - Many new languages still take 'Language as platform' approach
    - When ported, have platform-on-platform issues
    - Memory management, type-system, threading issues
    - Library duplication
    - If original language based on C, some extension libraries written in C don’t come over
There's a lot of time where the core team says, Java already does this well, just use the one from Java.


i like in depth optimisation posts like this post. however it starts by expressing disagreement with the statement that idiomatic common lisp is much faster than idiomatic clojure, but in the end doesnt do much to dispell this. all it does is it shows how much in depth knowledge of clojure and jvm you need in order to still be slower than a very simple implementation in common lisp. i myself am quite new to common lisp, and comming from the machine learning and scientific computing background, i am VERY impressed by its speed, interactiveness, and just general neatness of programming. i don't think there exists a language that comes close to this. i wish there were more members from ml and scientific community involved in the common lisp community. in fact there is one person in the clojure community who i would love to try common lisp: https://dragan.rocks/


That's a fair criticism, but indulge this hypothetical, if you will:

- assume the sliding window transducer existed in the standard library (there's an open jira for it)

- assume the programmer is not an expert, but is familiar with stuff like boxed maths in Clojure, which isn't hard to work around, well documented [1], and has a known performance impact

- assume the programmer will avoid using `last`

- assume the programmer is comfortable using transducers

In short, that the transducer exists and the programmer has at least 6 months of experience programming in Clojure full time.

Will you consider the resulting implementation complicated, in depth or non idiomatic? There's always the possibility I'm just used to it so it comes naturally to me, but as the title suggests, I consider the result, before dropping down to arrays, way more idiomatic than the initial implementation.

I don't begrudge CL's speed and implementation, it's the readiness with which Clojure is dismissed as being slow with only surface familiarity with it.

[1] - https://insideclojure.org/2014/12/15/warn-on-boxed/


While I write almost all my code in Nim nowadays, lisps have a special place in my heart despite not having written large amounts of lisp code.

I'm a physicist and when checking out CL in the past, it didn't feel like scientific computing was a huge thing in CL. You intrigue me with your post though. Do you have any resources for me to check out?


my background is also in physics and mathematics. in fact i think common lisp is well suited to a physicist-hacker type mindset. i could probably write a lengthy blog post about why i think this could be, but some of the key appeal factors are:

* unmatched interactive development which could be used for exploration of problems such as dynamical and complex systems

* although it is a very high level language you can perform low level optimizations in common lisp itself

* extremely powerful macro system that allows you to develop in a declarative style to reflect your problem domain. for example you can program in a syntax that is natural to general or special relativity, quantum mechanics, etc.

* fast and free implementations such as SBCL

* the language is standardized so your code will not need to be unnecessarily maintained (this is good for university professors that have repeating long-running courses)

i am not sure what type of references you are asking for but i will list few open source projects that might be interesting to you:

* CLASP - common lisp implementation build on top of LLVM. this project is headed by a computational chemist (prof. schafmeister) [0]

* PetaLisp - for HPC [1]

* MAGICL - Matrix Algebra proGrams In Common Lisp by Rigetti Computing [2]

* Quil - quantum instruction language developed on top of common lisp [3]

* CL-CUDA - use NVIDIA CUDA in Common Lisp programs [4]

* MGL - Common Lisp machine learning library [5]

* Maxima - a computer algebra system developed in common lisp [6]

i think these are plenty to show that there exists serious interest in using common lisp for scientific computing. it is possible that some of the people behind these projects are on HN and maybe they can further expand on this

[0] https://github.com/clasp-developers/clasp

[1] https://github.com/marcoheisig/Petalisp

[2] https://github.com/quil-lang/magicl

[3] https://github.com/quil-lang/quilc

[4] https://github.com/takagi/cl-cuda

[5] https://github.com/melisgl/mgl also https://github.com/melisgl/higgsml

[6] https://maxima.sourceforge.io/


Don't forget Lush: Lisp Universal SHell[1] by none other than Yann LeCun, and this was back in 2004. I played with it in 2012 or so. Lisp, SBCL, is great for science. I am playing with April, a subset of APL in Lisp. The best of both worlds!

[1] http://lush.sourceforge.net/

[2] https://github.com/phantomics/april


I am someone who uses Clojure a lot and understands transducers, but I would pretty much never consider transducers "idiomatic code".

In my opinion they are a straight-up an optimization and using them definitely reduces code readability for the vast majority of people who would ever read your code (yes, it makes your code less readable even if it sometimes doesn't increase the token count of your code)


I basically agree with the sentiment that, even with the sliding transducer not being in a library, the code he ends up with is far better (using keep w/ the checked - is probably the clearest version). With tests you can get convinced the transducer basically works even if you don't understand it, and it _is_ the idiomatic way of writing a transducer, even if writing them at all is unusual.


Clojure's power is not in its ability to be fast.

Immutability has a price and it's a price we're consciously paying. JVM also has a performance cost, but it's well worth paying due to the wins of running on a virtual machine.

For most boiler-plate tasks, especially where I/O is involved, Clojure is fast enough (eg. handling HTTP requests or talking to a database).

The true power of Clojure, in my opinion, is in how it allows you to think about problems; that machines produce side effects by interpreting a data structure, that the program is one transformation of the input data structure to the target data structure for the desired side effect. You don't think in terms of imaginary objects 'doing' things, you think about the input data structure and how the output data structure should look like, such that if fed to a library function, the desired side effect will occur (eg. the text will be printed or the robot arm will rise).

This mode of thinking about problems is closer to how machines actually work on a higher level and amusingly, 'close to the metal' C++, offers abstractions which take the writer into a fairy tale of class hierarchies, methods that 'do' things to other things, inserting, deleting, mutating everything all over the place. Because that's efficient and fast (and dangerous). But it's also focused a lot on telling the machine 'how' to run; while coincidentally making it do 'what' you want it to do. As a wizard who can put that orchestra with swords and knifes in motion without too much blood spilling on the floors, you can't but feel proud to have achieved it.

Data in -> data out. That's Clojure's way of thinking.

It's a lot simpler way of looking at the problem and maybe makes the `->` feel a bit insignificant and maybe not that important in most cases, as long as it produces the correct output.

The biggest performance hit will be choosing the wrong data structure for solving the problem in a functional way. If that's done correctly, Clojure does offer some tools for squeezing a bit more out of performance critical pieces of code (the articles covers some). Those should be a couple of functions in all your code base; for the boilerplate and domain logic or code involving non realtime I/O, it doesn't matter that much.

If performance is a big concern and you end up writing imperative code in Clojure, I think you'll be better off writing the bits in Java or choose a different language, which is better designed for that purpose.


Pretty sure that this isn't the post's purpose but it makes me want to never touch Clojure. I'm obviously not used to it however I can't help but think that it takes serious mental gymnastics to consider the "optimized" code the author came up with as even remotely good, from readability / comprehensibility / extensibility perspective.


    (def array-xf
      (comp
       (sliding-array 8 1)
       (keep (fn [^objects arr]
               (when (> 1000 (unchecked-subtract
                              (long (aget arr 7))
                              (long (aget arr 0))))
                 arr)))))
This is a standard feature of Clojure you learn called a transducer pipeline. It describes a pipeline of transformation for transforming a source:

(sliding-array 8 1)

This says to iterate 8 elements at a time in a sliding window manner, so you grab element 0 to 7, then 1 to 8, then 2 to 9, etc., and where the type of the windows will be an array.

(keep fn)

This says that for each of those sliding window, keep only the non-nil values returned by fn.

(fn [^objects arr] (when (> 1000 (unchecked-subtract (long (aget arr 7)) (long (aget arr 0)))) arr))

Finally, this is the function used by keep. It receives as input an array of each sliding window of 8 elements, and it says when the value of the last element in it minus the first element in it is greater than 1000 return the sliding window, otherwise return nil.

In effect, the whole thing will thus return only the sliding windows for which the value of the last minus first element in the window is greater than 1000.

(sequence array-xf times-v)

array-xf only describes the computation, it is generic to the source, so you need to apply it to a source, that's what this code does. It says to apply array-xf to times-v and give you a sequence of the result.


Are you talking about the lispness/parenthesis of the code or is there something else that makes the code inherently difficult for you to comprehend?


For the record, I'm the author of the original post on optimizing Common Lisp that the OP is addressing. The OP is a wonderful deep dive into optimizing Clojure. However, aside from possibly from avoiding the "last" function, and maybe using "keep" rather than "map" + "filter", I personally would avoid some of the strategies he outlines unless I truly needed the performance.

Transducers, for example, have a couple of counter-intuitive features that make them harder (for me, at least) to read... the order of "comp" arguments, for example, is reversed from normal usage. We tend to avoid them at my work unless they are truly needed for performance. In a complicated business domain, we struggle much more with building understanding than with slow code, so readability matters quite a bit.

(I leave aside the question of what is "idiomatic," since it seems pretty subjective, and probably depends somewhat on the crowd you run with.)

As several people have written here, Clojure is fast enough without optimization most of the time. (I leave aside the question of startup time -- a very different discussion.) Though I'm curious how to get good performance, most of the time it matters more to me for the code to be easier to understand than for it to run optimally fast, unless the code creates a bottleneck that has some clear impact on usability.

My goal in writing the original article was to play with optimizing Common Lisp as a learning exercise, rather than to throw down the gauntlet and assert that "Clojure is slow." The author's response was to show how to optimize the Clojure, which was very interesting! Ultimately, both languages have their strengths and weaknesses, and I consider myself lucky to be able to write Clojure for pay.


Thanks, I do think your original article just got people curious for the challenge, all good fun in my opinion.


So in the end, a vanilla CL implementation is still slower than all the work that went into the Clojure version? I love SBCL, and I am rediscovering Lisp after playing with a lot of other PLs. This is reassuring to me in my choice to go with SBCL and CL.


On Slack I gave it a go as well, my naïve idiomatic solution turned out to be 50x faster from the get go.

Times are all from my laptop using the same test input and running on JDK 11.

Baseline smt-8: Execution time mean : 2.386019 sec

My first attempt: Execution time mean : 48.977240 ms

    (defn smt-8'' [times-vec]
      (loop [res (transient []) pointer-1 0 pointer-2 7]
        (if-let [end-element (get times-vec pointer-2)]
          (let [start-element (get times-vec pointer-1)
                time-diff (- end-element start-element)]
            (recur
             (if (< time-diff 1000)
               (conj! res [(subvec times-vec pointer-1 (inc pointer-2))
                           time-diff])
               res)
             (inc pointer-1)
             (inc pointer-2)))
          (persistent! res))))
With performance tweaks: Execution time mean : 23.567174 ms

    (defn smt-8''' [times-vec]
      (binding [*unchecked-math* true]
        (loop [res (transient []) pointer-1 (int 0) pointer-2 (int 7)]
          (if-let [end-element (get times-vec pointer-2)]
            (let [start-element (get times-vec pointer-1)
                  time-diff (- end-element start-element)]
              (recur
               (if (< time-diff 1000)
                 (conj! res [(subvec times-vec pointer-1 (inc pointer-2))
                             time-diff])
                 res)
               (inc pointer-1)
               (inc pointer-2)))
            (persistent! res)))))
Less idiomatic as I'm using an array: Execution time mean : 1.678226 ms

    (defn smt-8'' [^"[J" times-arr]
      (binding [*unchecked-math* true]
        (loop [res (transient []) pointer-1 (int 0) pointer-2 (int 7)]
          (if (< pointer-2 (alength times-arr))
            (let [start-element (aget times-arr pointer-1)
                  end-element (aget times-arr pointer-2)
                  time-diff (- end-element start-element)]
              (recur
               (if (< time-diff 1000)
                 (conj! res [(mapv #(aget times-arr (+ pointer-1 %)) (range 8))
                             time-diff])
                 res)
               (inc pointer-1)
               (inc pointer-2)))
            (persistent! res)))))
All my solutions produce exactly the same result where it includes both the list of elements and their time difference.


Correction, you have to do:

(set! unchecked-math true)

globally, wrapping in a binding like I did does nothing, since it needs to be set at compile time. My measurement were with it set globally.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: