Hacker News new | past | comments | ask | show | jobs | submit login
Is Julia Really Fast? (medium.com/codex)
145 points by leephillips 3 months ago | hide | past | favorite | 49 comments

I'm extremely happy with Julia performance. I'm using it right now to prototype some new algorithms for my PhD.

How fast is fast? Well in my case I'm trying to come up with things related to collision detection and I'm getting <30ns using SAT between OBBs (Oriented Bounding Boxes) and <300ns for minimum distance computation using the GJK algorithm again between OBBs.

In the context of recommender systems I did an experiment when I started learning Julia comparing against Cython and wrote about it, although I was a newbie and I should update it: https://plopezadeva.com/julia-first-impressions.html

So, the article does a bit of goalpost switching. OP provides an example of massive speedup for real applications, which I don't doubt, but constantly compares between Python ( which is fast only really when you dispatch to optimized C/C++ ) and C.

"Unless you are a C/C++ wizard normal Python developers cannot fix or optimize big libraries like NumPy or TensorFlow. In Julia its users tend to contribute more to the libraries they use themselves."

I am excited by Julia. It seems to have a good niche and community around it. It seems well suited for its job. That focus will probably make it better than more general alternatives in short order.

Well, further, python's performance problems tend to come from it's C/C++ interop. Python made a few mistakes early on that make JITing it really hard without making backwards incompatible concessions.

Easy/Fast C interop is a blessing and a curse. Yes, you can do fast things from C with little overhead, but it also makes things like having a generational garbage collector or a JIT really hard.

I really don’t think that’s the problem with Python. The problem is that the actual semantics of the language make it so that if you did have a compiler, very very little could be assumed about a piece of code at compile time.

Classes in Python are basically sparkling Dicts and their memory layout can change at any time so basically everything must be boxed. Anytime you write a simple loop, the runtime has to ask “okay, what am I looping over!” “what sort of object is this function inside the loop acting on?” “How big is this object?” “Where does it live in memory?”, etc.

This is deeply baked into the semantics of the language, which is why only specific subsets of the Python language can actually be made blazing fast while retaining compatibility with Python semantics.

> Classes in Python are basically sparkling Dicts and their memory layout can change at any time so basically everything must be boxed.

My understanding was that most JavaScript JITs solved this problem well using shapes.

Python's classes can change dynamically though during runtime. So each iteration of a loop would need to check if the shape has changed. I'm no Python expert though, so please correct me if I'm mistaken.

But it's more than that. There's a really nice talk here from the creator of the Flask framework about a bunch of different semantic choices in the language design, leak out into the ecosystem and make many optimizations impossible: https://www.youtube.com/watch?v=qCGofLIzX6g

Taking a look at that, I think shapes would still work; [1] has some background material on them, but dynamically changing data is what shapes are for; you check if you should update an object's shape when assigning to it, and can JIT based on the type (you ought to be able to handle getattr and setattr with this too, JITting to a direct call to that method when it's necessary). Where this completely falls down in optimizing Python is that it changes the object layout, so you can't pass these to CPython without making them generic and converting them back after. Converting the layout to be generic would be a deep copy (which of course is semantics-breaking), converting it back wouldn't even be possible if the C code kept a reference to it around.

[1]: https://mathiasbynens.be/notes/shapes-ics


What Javascript JITs realized is that even though objects are malleable, they aren't often changed in most code. As a result, JITs make optimizations that assume that the object shape does not change (with checks to deoptimize when that assumption is violated).

But like you point out, this is all thrown out the window because the python C interface effectively directly exposes the memory and structure of python objects to C extensions. The nice thing about that is calling C is super fast and C has a lot of power. It can create, destroy, and modify python objects at will with little overhead.

This becomes problematic with garbage collection because you can't move memory around if C contains a direct reference to that memory (which is where most GC performance benefits come from). It effectively forces python to do reference counting.

Java also has C interfaces but they are MUCH more punishing to use and come with huge performance downsides. Even though they are working to decrease those costs, they won't the lower cost of python C costs. That's because giving a direct reference to an object managed by the JVMs GC is impossible. There is some level of memory copying that simply has to happen.

PyPy is what you get when you relax some of the compatibility constraints of python. It gets pretty significant performance boosts over the default python, but sacrifices the C interface.

The author gives an example of a little function written in a generic way, and shows how JIT compilation specializes it at runtime into highly optimized machine code. It’s a wonderful example of how Julia’s JIT compiler works, and how it can lead to significant speedups over even C or Fortran code in some cases.

At the Fortran Wiki http://fortranwiki.org/fortran/show/Articles there are links to benchmarks comparing Fortran to Julia and other languages.

Looking at https://github.com/arturofburgos/Assessment-of-Programming-L... https://www.matecdev.com/posts/numpy-julia-fortran.html https://github.com/zyth0s/bench_density_gradient_wfn https://github.com/PIK-ICoNe/NetworkDynamicsBenchmarks

people do find Julia to be faster than Python/Numpy, but it is not uniformly faster than Fortran. And Julia's start-up time should not be ignored. Quoting the last link, "In fact the whole Fortran benchmark (300 integrations) finishes roughly in the time it takes to startup a Julia session and import all required libraries (Julia 1.5.1)."

I don’t think anyone is claiming that Julia is uniformly faster than well-written Fortran. But you can get comparable performance with code using a style that many people find easier to read, write, and reason about; and it’s interesting that sometimes this code does outperform optimized C or Fortran code.

Startup time is much improved in recent Julia versions¹, but is certainly not negligible for short calculations.


Yes, also the mentioned example's speedup (from TFA) was likely at least partially from the accessibility of Julia, making some ancillary optimization easier or almost accidental. You don't need to beat Fortran "at fortran'ing" (the innermost column oriented processing) to be overall faster. I am imagining without evidence.

While it certainly won’t always win, there are many documented cases where Julia beats Fortran at Fortranning, e.g.



Comparing different metrics is not valid across all use cases. Counting startup time but excluding development time is disingenuous. Having used both Fortan and Julia the difference between development time is staggering.

Different tools for different uses cases is the best way to put it.

If youre actually doing HPC, say, on a cluster, the startup time will be amortized, or else, you're doing something wrong.

why? I commonly do things like screen 10,000 molecules for some property by calling a Fortran program 10k time. If each run takes 60s of compute, the 10s boot from Julia (over the nfs disk) would be pretty significant.

Using Julia in this case basically means having to rewrite all of the other programs into it and getting rid of eg snakemake or gnu parallel or any number of other very common HPC workflows.

In fact I'd venture that doing a sequence of things many times is just as common an HPC workload than single jobs running for very long times.

One reason to consider Julia for workflows like this is that it's been designed to improve reproducibility. Julia makes it really easy to distribute package state (through manifest.toml files), that make it easy to get exact copies of the package versions that the original author used.

Guix/Nix does this better across many programming languages. Unlike language-specific package managers, Guix is fully recursively reproducible ("bootstrappable"). And Nix isn't terribly far behind, either.

Julia itself is not even reproducible: https://github.com/JuliaLang/julia/issues/25900 https://github.com/JuliaLang/julia/issues/34753

Note that those issues are about deterministic performance, not results. Also guix and nix appear to be Unix only, which severely limit their use for Windows users. One of the places Julia really excels with reproducibility is inter OS, yes there are probably some bugs, but the ability to manage versions cross OS this easily is something I'm not aware of a good alternative for.

> Note that those issues are about deterministic performance, not results.

Not at all. The issue is that the default sysimage distributed with Julia cannot be reproduced across machines, e.g. there is no proof that it hasn't been tampered with. This is an open issue in Guix and Nix, though it affects all platforms.

> One of the places Julia really excels with reproducibility is inter OS

There really is no such thing as reproducibility in Windows, at least not in the sense outlined in bootstrappable.org . If you care about reproducibility, you've chosen the wrong platform.

Sorry for the confusion. I meant reproducibility in the sense that I can have a git repo to share code between a laptop running windows and a cluster running Linux, and if I run the program on both I'll be running the same program (same libraries, same versions, equivalent binary blobs, etc). I don't mean that it can give you a cryptographic proof that your code hasn't been tampered with maliciously, just that you can debug your code locally, then deploy it to whatever you want and be pretty sure that it will still work even if you deploy to a different OS an/or cpu architecture.

Is there any particular reason you wouldn't want to wrap that workload in a function/module/... and call it in a loop? Is the problem that your commonly used tools are exposed only as command line interfaces and not as libraries?

Yeah that's what I mean. This is the case for both commercial software (eg Q-Chem) and free software.

This is not to say that the boot time is always a problem, there are plenty of applications in which it doesn't matter at all (as you point out). However that's not always the case (by a long shot) which is why tools like GNU Parallel and snakemake exist.

Just keep it running as a daemon? https://github.com/dmolina/DaemonMode.jl

Of course it will not always beat Fortran, but don't think think it is damn impressive that a high level dynamic language with much higher productivity frequently matches or beats Fortran? That is a crazy achievement if you ask me.

Real world systems today are going to use a lot of computing resources, such as clusters, GPUs, tensor processing units, multiple cores etc. In such a world, anything that makes that easy to deal with is going to have the performance edge in practice.

Doesn't matter how fast a Fortran program would be in theory, if the Julia program is delivered years ahead of it.

Thanks for this. Not sure why its dowdownvoted. Numbers are worth looking at.

If julia could become a native vector/gpu programming language that is usable by mere mortals this could be a niche that might eventually grow into mainstream. But I can't help but notice that e.g. nvidia's github mentions only: Python, C++, C, Go, Cuda. (not sure the order matters).

Julia has excellent CUDA support[1], I have no idea why nvidia doesn't promote it more. It's fast, flexible and very featureful. There was a recent thread about it here: https://news.ycombinator.com/item?id=27496679

The AMD [2] and Intel [3] support is younger, but developing quickly.

There's also [4] for a unified API that works across different GPU vendors to avoid lockin

[1] https://cuda.juliagpu.org/dev/

[2] https://github.com/JuliaGPU/AMDGPU.jl

[3] https://github.com/JuliaGPU/oneAPI.jl

[4] https://github.com/JuliaGPU/KernelAbstractions.jl

Julia is faster than python for what I've done so far, so yes. I actually prefer Elixir but I use both.

Well, Common Lisp or Racket are a lot faster than Python, too, and both are general-purpose languages which has many advantages over a language specific for numerical computing.

Like what advantages? Julia is a "general purpose language" too, every bit as much as any of those other languages. It all boils down to the availability of polished utility libraries, and I don't exactly think Common Lisp or Racket have an indisputable edge there.

I don't get this argument. Maybe you could elaborate. I tried for a long time to get used to Racket. I really like the idea of LISP. The reason why I started using Julia wasn't because it was good at numerical programming. I was because it was a much better GENERAL PURPOSE programming language than Racket.

Unless you do a lot of meta programming stuff, I don't find Racket to be advantageous to use. Julia syntax is much friendlier towards general-purpose programming. It is easier to do mathematics as well as string manipulation with Julia.

Multiple dispatch also makes it a lot easier to write code. Getting into the LISP heavy thinking of Racket is harder than jumping onboard Julia.

I find that I have been able to learn functional programming, meta programming and other things that make LISP/Scheme famous in Julia than I ever managed in Racket.

If only Julia had a fluent static compilation it would be in the top 10 of languages. It's too bad there are not so much investment into a really important stuff that will benefit the world. All the investment goes to bubbles and financial pyramids.

I hope it'll get there! If anyone wants to help out, check out (e.g.) https://github.com/tshort/StaticCompiler.jl/pull/46

If PackageCompiler.jl was more polished you mean?

One of the gripes that I have with Julia is that if you write linear algebra code naively, you will have tons of unnecessary temporary allocations, while in Eigen (a C++ library) you can avoid most of these without sacrificing too much readability. (It even optimizes how to run matrix kernels on the fly!) Sure, you can rewrite your Julia code in C-style to remove those temporary allocations, but then the code becomes even less readable than what you can achieve in C++.

Here's an example: https://ronanarraes.com/tutorials/julia/my-julia-workflow-re...

The naive Julia version has unnecessary allocations and therefore is 23% slower than the optimized version:

    @inbounds for k = 2:60000
        Pp   .= Fk_1 * Pu * Fk_1' .+ Q
        K    .= Pp * Hk' * pinv(R .+ Hk * Pp * Hk')
        aux1 .= I18 .- K * Hk
        Pu   .= aux1 * Pp * aux1' .+ K * R * K'

        result[k] = tr(Pu)
In order for this loop to match the C++ version you need to use C-style functions:

    for k = 2:60000
        # Pp = Fk_1 * Pu * Fk_1' + Q
        mul!(aux2, mul!(aux1, Fk_1, Pu), Fk_1')
        @. Pp = aux2 + Q

        # K = Pp * Hk' * pinv(R + Hk * Pp * Hk')
        mul!(aux4, Hk, mul!(aux3, Pp, Hk'))
        mul!(K, aux3, pinv(R + aux4))

        # Pu = (I - K * Hk) * Pp * (I - K * Hk)' + K * R * K'
        mul!(aux1, K, Hk)
        @. aux2 = I18 - aux1
        mul!(aux6, mul!(aux5, aux2, Pp), aux2')
        mul!(aux5, mul!(aux3, K, R), K')
        @. Pu = aux6 + aux5

        result[k] = tr(Pu)
... which is quite dirty. But you can write the same thing in C++ like this (and even be a bit faster than Julia!):

    for(int k = 2; k <= 60000; k++) {
        Pp   = Fk_1*Pu*Fk_1.transpose() + Q;
        aux1 = R + Hk*Pp*Hk.transpose();
        pinv = aux1.completeOrthogonalDecomposition().pseudoInverse();
        K    = Pp*Hk.transpose()*pinv;
        aux2 = I18 - K*Hk;
        Pu   = aux2*Pp*aux2.transpose() + K*R*K.transpose();

        result[k-1] = Pu.trace();
which is much more readable than Julia's optimized version.

If Julia had a linear-algebra-aware optimizing compiler (without the sheer madness of C++ template meta-programming that Eigen uses), then Julia's standing in HPC would be much, much better. I admit that it's a hard goal to achieve, since I haven't seen any language trying this (the closest I've seen is LLVM's matrix intrinsics (https://clang.llvm.org/docs/MatrixTypes.html), but it's only a proposal)

You're comparing dynamically resizable Julia arrays against Eigen's static arrays. Julia does have a static array package, StaticArrays.jl.

Oh, you're right. I naively thought that the Julia code was written with static-size arrays.

But still, wouldn't even StaticArrays.jl create some unnecessary allocations and copies when written in the naive (clean) way? (for example, when calculating something like z = Ax + By, Ax and By are temporary allocated in Julia's case but you can avoid this in Eigen?)

StaticArrays.jl will create arrays that are stack allocated and optimized just like in Eigen. There are no temporaries in that example.

Even if you used MArrays, which are small statically sized mutable arrays, the intermediate temporary arrays would live on the stack and thus not cause any allocations.

I suppose I should caveat my grandparent comment by saying that StaticArrays and MArrays are only good up to 14 x 14, if memory serves me correctly. That rule of thumb might be out of date; the compiler might construct large tuples more efficiently these days.

I believe Eigen has similar problems, but im not sure.

However, there is work happening for large stack allocated arrays in julia. StrideArrays.jl can create stack allocated arrays that work on any size that will physically fit on the stack.


This package is work in progress though and has some rough edges.

That's exciting news :)

In this code, the main problem---I think---is that there are intermediate results that are being allocated, e.g., Fk_1 * Pu * Fk_1'. I will speculate that you could improve on the baseline code by preallocating these in the same way as Pp, K, aux1, and Pu are initialized outside of the loop.

Are you sure that the difference is due to the allocations? I would expect this to be dominated by matrix multiplies or svds. Are you comparing this with the same blas/LAPACK?

Edit : OK, I see those are small matrices. Then Staticarrays should be a nice contender here, both for speed and readability.

sign in required

or you can try outline.com


Use Incognito Mode/Private Browsing if Medium is telling you that you ran out of free member-only stories.

You can also just delete your cookies for medium.com

Not for me?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact