If you really care about extracting every possible ounce of performance out of Python's scientific stack, which relies extensively on Numpy, the best practical guide I have found for doing that is "From Python to Numpy," by Nicolas P. Rougier of Inria: https://www.labri.fr/perso/nrougier/from-python-to-numpy/
Here's a typical example of the kinds of optimizations this guide teaches you, in this case by avoiding the creation of temporary copies of Numpy arrays in memory:
# Create two int arrays, each filled with with one billion 1's.
X = np.ones(1000000000, dtype=np.int)
Y = np.ones(1000000000, dtype=np.int)
# Add 2 * Y to X, element by element:
# Slowest
%time X = X + 2.0 * Y
100 loops, best of 3: 3.61 ms per loop
# A bit faster
%time X = X + 2 * Y
100 loops, best of 3: 3.47 ms per loop
# Much faster
%time X += 2 * Y
100 loops, best of 3: 2.79 ms per loop
# Fastest
%time np.add(X, Y, out=X); np.add(X, Y, out=X)
100 loops, best of 3: 1.57 ms per loop
That's a 2.3x speed improvement (from 3.61 ms to 1.57 ms) on a simple vector operation (your mileage will vary!).[1] This only scratches the surface. The guide goes into quite a bit of explicit detail about how Numpy arrays are constructed and stored in memory and always explains the underlying reasons why some operations are faster than others. In addition, the guide has a section titled "Beyond Numpy" that points to even more ways of improving performance, e.g., by using Cython, Numba, PyCUDA, and a range of other tools.
import numba
@numba.vectorize(nopython=True)
def add2(x, y):
return x + 2 * y
add2(X, Y, out=X)
This is 40% faster on my machine. It only needs to read X and Y once, and write X once. It does take a bit of extra time on the first run, because it uses JIT compilation (LLVM).
Numba lets you implement vectorized operations in terms of elements. In the above, x and y are scalars. You get support for various NumPy "ufunc" features for free, such as the out parameter used here.
X = ones(Int, 1_000_000_000)
Y = ones(Int, 1_000_000_000)
# explicit dot notation
X .+= 2 .* Y
# or with a macro
@. X += 2 * Y
The + and * do not allocate, the operations are fused into element-wise operations so the vectors are traversed only once, and the loop is compiled to vectorized code. No need for Numpy / BLAS.
That's impressive! What does fusion mean? I've heard fusion multiple times and I've the feel that fusion (in haskell at least) is something very advanced that's not really thought. Can you explain?
Two passes with one operation each are fused into one pass that has two operations?
Too bad common languages today (Python and its extensions for example) are a big step back from Matlab when it comes to matrix and vector operations syntax.
Not only now you are employing a foreign data type (numpy array) that requires separate learning independent (potentially confusing) to your existing Python knowledge, you also need mentally differentiate:
X = X + 2.0 * Y
X = X + 2 * Y
X += 2 * Y
X += Y; X += Y
, all irrelevant to your semantic logic.
Why not realizing that it is much more straight forward to directly do that in C? It is only a bit more to type but a much simpler mental load to understand (and maintain in the long run). If performance is at stake, spell it out in C (with x86 intrinsics if necessary) and put them into the semantics of your code.
Concerned with performance (in ms order) in Python is ill.
There's no need to mentally differentiate between all those approaches when writing code with Python's scientific stack.
In practice, code is initially written with relatively little regard for how it affects the performance of Numpy and other libraries like it, and then, afterwards, and only if necessary, these techniques are used to optimize those lines of code that prove critical to performance, typically only a small fraction of all lines.
However, if you're working on a project for which every line of code is performance-critical, then I would agree with you, Python would not be a good choice for that.
numpy calls BLAS libraries, which are heavily optimized and parallelized. Note that it is also very dependent on which particular BLAS library is used, as several are older and not nearly as optimized as newer ones (the difference can be rather dramatic). If you're curious, the best one to use is OpenBLAS.
BLAS really shines when you do matrix multiplication, for element-wise operations the best you can do is to add numbers using SIMD instructions or put the load to GPU, and most numeric libraries already when possible. The benchmark about seems unrealistic, here are results from my newest MaBook Pro:
In [2]: import numpy as np
In [3]: X = np.ones(1000000000, dtype=np.int)
In [4]: Y = np.ones(1000000000, dtype=np.int)
In [5]: %time X = X + 2.0 * Y
CPU times: user 10.4 s, sys: 27.1 s, total: 37.5 s
Wall time: 46 s
In [6]: %time X = X + 2 * Y
CPU times: user 8.66 s, sys: 26 s, total: 34.7 s
Wall time: 42.6 s
In [7]: %time X += 2 * Y
CPU times: user 8.58 s, sys: 23.2 s, total: 31.8 s
Wall time: 37.7 s
In [8]: %time np.add(X, Y, out=X); np.add(X, Y, out=X)
CPU times: user 11.3 s, sys: 25.6 s, total: 36.9 s
Wall time: 42.6 s
No surprise, Julia makes nearly the same result:
julia> X = ones(Int, 1000000000);
julia> Y = ones(Int, 1000000000);
julia> @btime X .= X .+ 2Y
34.814 s (6 allocations: 7.45 GiB)
UPD. Just noticed 7.45Gib allocations. We can get rid of it as:
julia> @btime X .= X .+ 2 .* Y
20.464 s (4 allocations: 96 bytes
or:
julia> @btime X .+= 2 .* Y
20.098 s (4 allocations: 96 bytes)
I could have not noticed use of swap in the previous test, so I repeated it on a Linux box and 1e8 numbers (instead of 1e9). Julia took 100.583ms while Python 207ms (probably due to double reading of the array). So I guess adding 1e9 numbers should take about 1 second on a modern desktop CPU.
I think the benchmark was probably done on a supercomputer. But that's really interesting how well Julia did. I did a basic logistic regression ML implementation in it years ago and I was impressed, but I stopped following its progress. Might have to keep it on my radar!
I still don't see how it's possible, no matter how optimized it is. Assuming 8-byte ints (which is what np.int seems to be on 64-bit) you're looking at reading at least 16GB of data since you're operating on two 8GB arrays and you have to read the data in each one at least once. If you can do that in a millisecond, that's a memory bandwidth of about 16TB/s. I thought modern CPUs had memory bandwidth of tens of GB/s, maybe low hundreds for really high-end stuff, and some brief searching seems to confirm that. What am I missing?
Edit: testing the given code on my 2013 Mac Pro, the fastest one at the end completes in one second or so (just eyeballing it), which makes a lot more sense.
The example the OP gave was from a tutorial/website that is hosted at the Laboratoire Bordelais de Recherche en Informatique. I imagine they probably have some heavy duty machines to crunch numbers on.
Not only that, but going through the array twice is apparently faster than doing it once and multiplying by 2. Is multiplying more expensive than fetching/storing from memory? This is counter-intuitive. I must be missing something.
You're going through the array twice whatever you do (once when multiplying by two then a second time when adding the arrays together), Python/NumPy isn't clever enough to figure out that it can be done in a single loop.
In [1]: %%timeit x = np.ones(100000000); y = np.ones(100000000)
...: np.add(x, y, out=x)
...: np.add(x, y, out=x)
...:
1 loops, best of 3: 287 ms per loop
In [2]: %%timeit x = np.ones(100000000); y = np.ones(100000000)
...: x += y
...: x += y
...:
1 loops, best of 3: 287 ms per loop
In [3]: %%timeit x = np.ones(100000000); y = np.ones(100000000)
...: np.add(x, y, out=x)
...: np.add(x, y, out=x)
...:
1 loops, best of 3: 286 ms per loop
In [4]: %%timeit x = np.ones(100000000); y = np.ones(100000000)
...: x += y
...: x += y
...:
1 loops, best of 3: 280 ms per loop
From looking at most of these examples it's clear that Python will be able to perform really well with projects like numba, Cython and PyPy in the picture. My impressions after using Cython are that Cython can for the most part even outperform Julia. Couple that with the fact that with Julia there will probably be a few steps before you squeeze the maximum performance possible of out your algorithm, and it makes Cython a no brainier for a developer that is already using Python.
That is until you consider that most of these examples do not showcase (what is in my humble opinion) Julia's true strength - homoiconicity and an opt in extensible type system that is part of the language.
It's trivial to make a new Class in Python and set up behavior that one would be interested in. One could easily implement decorators, metaclasses, descriptors, properties and other dunder methods to customize class behavior to the heart's desire. But Cython (as far as I understand) does not support these Python features. If an existing library implements an interface using these Python features, it's not obvious to me how it would be possible to use Cython to improve performance. Julia on the other hand has no such restrictions. I would say it is desired to use macros and metaprogramming in everyday programming. I've been a Python user for over 6 years now. I've only looked at Julia for a few months now and I already can see concepts and ideas in Julia that have no analogue in Python. But most if not all Python programming features translate to Julia one way or the other.
I love Python, but I can't help but feel that the lack of a opt in type system and more powerful metaprogramming constructs is hurting it in these comparisons.
Isn't adding things like memoization or writing the code in C (Cython) missing the whole point of the benchmark, which is to test the overhead of recursion / function calls in the language itself?
Users of programming languages interested in the end result (output of the program etc) do not care. They want the fastest performing language for the job. Julia website has been misleading people. Due to those claims, I spent a week porting some of my simulation code to Julia before I realized that it is actually slower in (optimized) Julia than in optimized Python.
For those who only care about the output of fib(20) there are more efficient methods than any of the Julia or Python implementations posted in the link, e.g. lookup tables.
The assumption in benchmarks is of course that the results carry over to other use cases. Here, what is being tested is the overhead in recursion, nothing else. The fact that it happens to be Fibonacci-numbers that is being computed is irrelevant.
Isn't that basically what he does when he adds caching? I know it's not a static lookup table but you could prep it by invoking it with a sufficiently large n.
You are not restricted to recursion, loops, functions etc, but these language constructs might be useful in situations, and therefore the performance (overhead) of these constructs is interesting.
I highly doubt that was optimized Julia code. Sure you might have used a really fast python library and a very slow Julia library, but there is no way optimized Julia code should be slower than Python. Julia simply offers far more ways to control performance than python.
And thus we want benchmarks that measure language performance, not the fastest way to compute Fibonacci numbers. The solution to the latter problem is the same in Python and Julia and consists of calling the assembly function in gmp...
Julia has a lot more potential for optimizations than python, but what python has going for it is the larger ecosystem. So if you want to write a one-off experiment that's similar to stuff that already exists in C bindings to python you should use that. If you plan to write a large application that you still want to optimize for current processors in 10 years then I'm not sure if python is a good choice.
I agree... Judged against the title, the article adds little value: take one micro-benchmark; implement the naive Python algorithm using non-CPython approaches like Cython, Numpy and Numba; and stick on clickbait title that implies a speedup that applies in all cases.
The article would be much better if it ditched the comparison to Julia and instead showcased "Some ways to make Python code faster."
Isn't the benchmark missing the whole point of what people actually want to do, which is to run their calculations fast (without caring whether they are written using the constructs the benchmark tests or not)?
People might want to use recursion. They might want to split out their code into small functions without having to think about the overhead of function calls. They might want to just write a for loop instead of transforming it to vectorized notation.
The benchmark investigated in the link answers the question "if I use recursion, and the function body is small, how much will I be penalized?". Changing the benchmark so that it no longer answers that question makes it pointless.
Python is a scripting language, so its strong point is being used as a glue, and half of its standard library is implemented in C anyway.
Plus, the "C-implementations" he mentions are available as readily usable modules (numpy), or semi-transparent jit/aot compilers only needing a few annotations (Cython, numba), not actual C you have to write.
Besides, isn't the whole point: finish your project fast with the language you know using whatever it makes available to easily speed your code up?
As opposed to: "be a purist and not use wrapped libs written in another language".
Who cares for that? Even if it comes up, it's to avoid the hassle of having to deal with an additional language, setup etc -- which for numpy, Cython etc is almost none-existent (as you don't need to actually deal with C).
And of course, despite the purity of Julia's "single language", the hassle of moving to a totally different language, which few use, is not yet stable in syntax, compiler etc, and has fewer libs, should also be considered...
The specific purpose of the benchmark, though, is to compare implementations of the same algorithm natively in the language itself, as explained explicitly on the Julia website just under the table of benchmark results (see quote below).
As such, I do think the article misses the point somewhat. Of course, if there's a numpy function that does what you want, you'd use it in real life. But what if there isn't? The nice thing about Julia is that the function can be written in Julia itself, and fast.
> It is important to note that these benchmark implementations are not written for absolute maximal performance (the fastest code to compute fib(20) is the constant literal 6765). Rather, all of the benchmarks are written to test the performance of specific algorithms implemented in each language. In particular, all languages use the same algorithm: the Fibonacci benchmarks are all recursive while the pi summation benchmarks are all iterative; the “algorithm” for random matrix multiplication is to call the most obvious built-in/standard random-number and matmul routines (or to directly call BLAS if the language does not provide a high-level matmul), except where a matmul/BLAS call is not possible (such as in JavaScript). The point of these benchmarks is to compare the performance of specific algorithms across language implementations, not to compare the fastest means of computing a result, which in most high-level languages relies on calling C code.
> Of course, if there's a numpy function that does what you want, you'd use it in real life. But what if there isn't?
I have been in this exact situation, a numerical algorithm that was missing from Numpy but the rest of the project is in Python.
The solution is:
1. Write a Python function that operates on numpy arrays,
2. Add a few Cython type declarations to loop variables,
3. Mark the source file as "compile with Cython at runtime", which seamlessly turns the Python function into a C library.
The end result was a 1000x speedup compared to pure Python, very close to numpy built-in functions working on similarly sized arrays. And it needed only about 5 lines of setup code and type declarations for a few variables - all the code could still be Python and use all of Python even in the compiled files.
>The specific purpose of the benchmark, though, is to compare implementations of the same algorithm natively in the language itself, as explained explicitly on the Julia website just under the table of benchmark results (see quote below).
But then they go and write their own sort instead of using the language provided ones when offered. All these show is that julia is apparently faster than incredibly unidiomatic python written by someone who clearly doesn't write python. Okay. That's neat.
Numpy is such an essential library for any type of scientific computing in Python that ignoring it would be missing the point, if anything. The library infrastructure is part of the appeal of a programming language and Numpy is the default for anything compute-heavy in Python.
>shifting all the runtime heavy computation to C-implementations... this article is missing the whole point
The author understands your perspective but he's deliberately using a different one. The idea is that a data scientist user would realistically use NumPy/SciPy optimized C libraries instead of writing raw loops in "pure Python" to walk pure Python lists that model matrices. Therefore comparing pure Python code (interpreted by the canonical CPython interpreter) to Julia is the opposite of his goal.
The article's title is: "How To Make Python Run As Fast As Julia"
The author wanted to write about: "How To Make Python _Projects_ Run As Fast As Julia"
But many readers insist that the article should have been: "How To Make Pure Python Code Run As Fast As Julia"
(The 2nd type of article is also interesting, but the author didn't write it and didn't claim to.)
The article's comment permalink doesn't seem to jump to his exact comment so I'll copypaste the text here:
>There is indeed a disagreement about the purposes of the benchmarks. I see at least two purposes at stake here.
>1. A user point of view, which is to see how t best accomplish things in a given language. It is the result of various tradeoffs, including this: balance the time and effort to code something with the efficiency you get. That's the view of most Python users reacting to my post. We don't mind using Python libraries, even if they aren't written in 'pure' Python. Actually, the massive set of existing Python libraries is probably one key reason for its success.
>2. A language implementer point of view, which focuses on how elementary language operations perform. That's the purpose of Julia micro benchmarks I think.
>If people do not agree on the yardstick they use, then the discussion is not going to be fruitful. This disagreement explains most of the comments I saw until now.
> The author understands your perspective but he's deliberately using a different one.
In this case the way the author shows it isn't the best one: he modifies Python code to be more realistic - that's ok, but doesn't he do the same thing for Julia? Obviously, writing a recursive fibonacci functions isn't the best way to implement it. Obviously, using caching can improve performance. But why not to apply these changes to both implementations?
>he modifies Python code to be more realistic - that's ok, but doesn't he do the same thing for Julia?
Yes, I agree he didn't rewrite the Julia fibonacci examples the same way as Python.
My comment was speaking more to the usage of "optimized C libraries" in his benchmarks as being appropriate for his particular goal. (As response to poster hojijoji's objection to C-implementations.)
I used "goal" to mean his "overall goal" of showing optimized C libraries instead of pure Python for various scenarios. (My response to poster hojijoji objection to C libraries.) Using C libs is not an invalid benchmark if one understands why the author used them.
Yes, when the author didn't change both Python AND Julia fibonacci examples in exactly the same 1-for-1 manner, it does detract from his overall message because it invites nitpicking. (The nitpicking is reasonable if you're hyperfocused on that fibonacci example.)
Based on your other responses in this thread, you seem to want him to write Python-vs-Julia benchmarks that's suitable for benchmarksgame[1]. You have a valid perspective but that's not the article he claimed to write.
My question is then, why bring up Julia at all? Of course, there will be nitpicking when you put two languages against each other, in a benchmark written for a specific purpose, and then start to modify the implementation for one of the languages. It seems like the goal of the blog post would just as well be achieved by saying "here are some ways of speeding up a function in Python".
Because he wasn't writing about Python in a vacuum. In his very first paragraph[1], one can see that the article was a response to Julia's benchmark.
Your question could be reversed for the authors of julialang.org website and they could've restricted themselves to say "here are some ways of writing functions in Julia" -- without bringing up Python at all.
But the Julia folks didn't do that because ... people like to write comparisons to other things!
[1] see 1st paragraph that begins and ends with: "Should we ditch Python and other languages in favor of Julia for technical computing? [...] did the Julia team wrote Python benchmarks the best way for Python?"
There are a few things I consider when I try to improve performance of my code.
1) I really like the advice of "make it run, make it right, make it fast" [0]. First I build it, then I write a thorough test harness (or make multiple versions run at the same time and check their results at runtime), then I rewrite it.
2) I don't really like optimising to the core, like it's mentioned in the article, things like Cython and numba often add dependencies that don't port well and they may reduce legibility and maintainability (that's more about Cython than numba). What I find most useful, and this should be obvious, that algorithm/data structure changes often yield the biggest benefits.
3) The old rule that if you speed up code that only runs 1% of the time, you're not gaining that much, I try to keep that in mind. (There are exceptions, of course.)
4) Performance is not just runtime, it starts with coding it up in the first place. So start to finish, you're not always better off with a fast language. If it's a one-off, it can be more performant to use a slow language, if it's a tool that will run untouched, you might want to spend some time tinkering with it. Etc.
All of this seems obvious in retrospect, but it took me a while to appreciate these principles.
A side historical note which doesn't affect anything; http://wiki.c2.com/?MakeItWorkMakeItRightMakeItFast traces that expression back to Stephen C. Johnson and Brian W. Kernighan's "The C Language and Models for Systems Programming" in Byte magazine (August 1983) - "the strategy is definitely: first make it work, then make it right, and, finally, make it fast."
The make it work/make it right principle is also referred to as "tracer bullets" in certain circles: if you get the bare bones full process working it's much easier to flesh it out than doing it one "fully specced" component at a time.
I had a lot of experience a few years ago with Julia, having ported a lot of my numerical code in the language. I ended up severely disappointed with it for moderately complex numerical computing projects.
The language was really nice; the problem was terrible performance, and I believe that this was caused by bad design of the memory semantics. Even in optimised code, there were temporary objects and copies all over the place, and they were very hard to eliminate without resorting to using global arrays everywhere.
I had exactly the opposite experience. If you were having problems with temporaries, you were probably using Matlab style vectorized operations. Back when you tried Julia, explicit loops could avoid this allocation issue. Now, Julia has syntactic broadcast operators which will fuse loops for you. See https://julialang.org/blog/2017/01/moredots
> Even in optimised code, there were temporary objects and copies all over the place, and they were very hard to eliminate without resorting to using global arrays everywhere.
Could you elaborate a bit with some examples, because this goes against everything I understand about the strengths of the language: (without having used it beyond a few first examples myself)
- you can define types and operations on them down to the bits themselves, giving you a lot of control over the memory (and definitely a lot more than Python)
- it has a lot of support for in-place mutation out of the box
- globals are almost always bad news for performance
Sure, adding two arrays with
C = A + B
.. produces an intermediate array, yes, but if you know A and B are not reused, wouldn't using:
Even better would be to write (see dot vectorization [1]):
C .= A .+ B
Benchmarks for 3 matrices of size 1000x1000:
julia> using BenchmarkTools
julia> @benchmark C = A + B
BenchmarkTools.Trial:
memory estimate: 7.63 MiB
allocs estimate: 2
--------------
minimum time: 2.359 ms (0.00% GC)
median time: 2.713 ms (0.00% GC)
mean time: 3.794 ms (28.81% GC)
maximum time: 62.708 ms (95.27% GC)
--------------
samples: 1314
evals/sample: 1
julia> @benchmark C .= A .+ B
BenchmarkTools.Trial:
memory estimate: 128 bytes
allocs estimate: 4
--------------
minimum time: 1.232 ms (0.00% GC)
median time: 1.320 ms (0.00% GC)
mean time: 1.356 ms (0.00% GC)
maximum time: 2.572 ms (0.00% GC)
--------------
samples: 3651
evals/sample: 1
Note that memory usage dropped from 7.63MiB to 128 bytes.
A little bit late to the party here, but the number of allocations is really just 0 bytes. It shows 128 bytes because the benchmark is creating new references to A, B and C. To correct this use either interpolation with $A, $B and $C or initialize A, B and C in the setup phase:
> @benchmark C .= A .+ B setup = (A = rand(1000, 1000); B = rand(1000, 1000); C = rand(1000, 1000))
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 2.048 ms (0.00% GC)
I had exactly the same experience. Julia's JIT compiler is (or at least was) really picky to what code it did and didn't like. Two pieces of which look 'basically' the same can have a performance difference of an order of magnitude due to the JIT liking one version and not the other.
How do you know that the code was optimized for Julia? Your poor results suggests you are simply not familiar with how to optimize in Julia. E.g. did you check type stability with type_warn() ?
I guess, it will take loosing market share to Julia for Python devs actually start taking into consideration making PyPy the canonical implementation.
Rewriting libraries in C isn't making Python code run faster.
The latest Julia conf had lots of cool presentations, and even though 1.0 might come only at the end of the year (with luck), the uptake among the research community is quite good for such a young language.
Pypy is not that much faster than Python and is certainly still much slower than using a C library. Indeed, there's a limit to how fast python can be considering how many operations it needs to do for even simple statements (such as summing two integers[0]).
This is the cost of being so extremely dynamic, more so than Javascript and many other scripting languages.
I don't know that it "is", but it certainly can be. In Python there's just too many ways to screw with the meaning of "a.b" at runtime; it could be in the class dict, it could be a property, it could be modified via a couple of methods on the class, it could be screwed with a couple of other ways too IIRC. You have to write some sort of code that either proves these things can't happen, which gets really hard as the program scales up, or code that checks for them happening. Other languages that can be "dynamic" but still have only one or two things to be checked can be much more easily JIT'ed.
I believe much similar magic is possible in Python, but you get to implement it on top of the relatively simple interactions between instance dict, class dict, and metaclass. They would not be bullet-proof, and they are not built into the default implementation. But they are very possible.
"Everything is an object" is a higher-level abstraction (that is, maps worse to native operations) than "everything is a function" and "everything is a linked list." Also, I don't know much about Lisp, but I was under the impression that Lisp metaprogramming can be entirely or mostly resolved at compile-time, while the same is not true of Python.
In the context of the above comment, I'd say that they're about the same. Pypy appears to run between half-as-fast and twice-as-fast as java (hotspot) with a much smaller memory footprint, depending on the benchmark.
Either Numba has improved by an order of magnitude since I looked last (congratulations that's fantastic, its a difficult problem to solve), or you haven't met the edge cases yet
>Rewriting libraries in C isn't making Python code run faster.
No, but nobody cares for the purity of only-Python code. They care for running their calculation fast, and if that takes C-enabled modules they are totally fine with it. With things like numpy available, it's not like they have to write those calculations themselves.
Besides Numpy smokes PyPy for the same kind of calculations (not that one would care to rewrite the tons of stuff available in there in pure Python).
>No, but nobody cares for the purity of only-Python code.
Let me call BS on that. I would prefer not to be forced to drop down to C if I could have it. I am quite sure many feel that way.
I would definitely like the ease of calling an existing C or Fortran library that already does what I need. But that I need (almost to a fault) to context switch between languages is certainly not a high point especially when I am prototyping something new and for which there aren't any good libraries yet.
Yes it absolutely doesnt. I was talking about things when you have to reach beyond Numpy/Scipy/Pandas/Scikits. For C or Fortran libraries and tools like those that are already there, Python is very pleasant to use.
Do you have any source for this affirmation that people are moving from Python to Julia/Go/etc ?
Those language are definitely growing, but so is Python and while there is movement between the languages I am not sure is a massive unilateral migration as you seem to imply.
The whole point of programming language conferences is to focus on bleeding edge "what's new/potentially coming down the pike", instead of "what's being adopted massively".
Of course the type of people who want to dabble with different languages are going to move around. Most of the rest of the world wants to learn the one language that mostly gets them where they need to go, and be done with it.
There is a photo I recently saw of 1200 students starting in Berkeley's introductory data science class. They're doing Python. Are there even 1200 serious Chapel users in the world?! I mean, I love the ideas in Chapel (and even the early ZPL), and I drew inspiration from them. But the fate of a computer language is as much determined by the growth dynamics of its user community as by its differentiating features.
While Chapel's community is small today, it has real users and it's growing.
Chapel is a compelling language that offers a good combination of productivity and performance.
The Chapel project is more ambitious than Python, Julia, or Go in that it provides a unified programming model for parallelism and locality that enables scaling from laptops to clusters and supercomputers. Because we are aiming for a higher target, it has taken longer to move beyond the prototype stage than it would for a serial language.
There is no correlation whatsoever with the amount of talks in conferences and what's happening in real life.
If conferences were a metric for anything you'd assume most of coding in the world is happening in Node.JS/React/JS-package-manager-of-the day.
The reality is that the vast, vast majority of devs and code that's actually running the world are doing some combination of Java, C#, PHP, C++, C and COBOL. Those people aren't exactly going to conferences; they have regular old boring 9 to 5 jobs and are plenty busy trying to make shit work or diving in piles of legacy code or installing run-of-the-mill CMSes.
The empirical information from the increase in the amount of talks we get to see at the respective LanguageConf, and who is doing them, seems to state otherwise.
Neither does C. But that, and all the other languages you listed, are not Python. They have differing implementations, constraints and resources that make the comparison unfair and IMO invalid.
> Just the community not caring about it and forcing everyone else to go down into C or somewhere else, it seems.
That, or the inherent dynamic nature of Python that makes it ridiculously hard to JIT (especially without breaking compatibility), and that a tiny subset of developers have enough experience (and will) to work on it as there is no Google willing to funnel buckets of money into making a V8 for Python, to the point where PyPy has reached an OK state after 10 years of development and a codebase that dwarfs CPython in size and complexity.
I hate people who deride the community for not caring enough about adding a JIT. Yeah, that sure is the reason. Not mind-boggling technical limitations, nope, just pure laziness.
I gave up on Python about version 2.0, and went to languages that care about having a JIT/AOT compiler from the get go, instead of forcing me to write C.
A lot has changed in the 17 years since 2.0 was released. Performance is one of them.
C and Python have sometimes overlapping use-cases, but often not. I'm not sure if your arguments still hold water, especially given projects like Cython and PyPy that give AOT and JIT compilation to the language so you don't have to write C.
This is totally missing the point of the Julia benchmarks. They were meant to show that you can implement fast low level code in Julia. That is why the python code also had to be written with loops otherwise it would really just test the performance of a library and not the language.
The point of Julia is to not have to deal with both C/C++/fortran and python code when creating a high performance solution. You don't always have a high performance library to utilize for your particular problem.
Being able to do fast things using typical procedural idioms, and being able to implement new, fast routines without dropping to a different, lower-level language are huge wins.
Python is a nice language syntactically, it's just too bad that fundamental design decisions made it intrinsically slow.
It's easy to optimize small kernels. Less straightforward is optimizing an entire library in Python, especially if pieces need to be composed. For this, the most interesting library I've come across is loopy which handles things like fusing kernels, unrolling loops _before_ code generation.
You don't have to as Julia can still get high performance. And BLAS is actually an example of Julia strenghts. The ability to pick exact code specializations help effectively pick the most suitable BLAS function. Python can't pick correct function based on type and number of arguments.
As for gluing. That is often needed not for performance reasons but because you don't want to reimplement lots of proven code which is well devugged and tested. One of the reasons to pick BLAS.
I use Julia a lot for shell like scripting. That means running a lot of unix commands I don't know how was implemented like the ios code signing tools.
Doing this in Julia is far superior to doing it in python. I was actually shocked how clunky it was in python when I attempted a rewrite.
>Python can't pick correct function based on type and number of arguments.
What? Numpy does this.
>Doing this in Julia is far superior to doing it in python.
Then, quite simply, you're doing it wrong in python. Show me an example of this system code that's nice in julia and clunky in python and I'll show you a superior implementation in python.
And if you want to timeout or parse logs or whatever, you go from there and it gets more and more complicated. If Julia has solved this then the community should be advertising this above all the alleged performance stuff.
$ time julia -e 'print(1+1, "\n")'
2
real 0m0.996s
user 0m0.664s
sys 0m0.367s
$ time python3 -c 'print(1+1)'
2
real 0m0.160s
user 0m0.080s
sys 0m0.033s
Almost a second just to print a number in Julia. This makes it unattractive for shell scripting. Startup performance is an open issue in Julia and smart people are working on it (I hope); but it's not there yet.
"The only reason to pick python is momentum, libraries and mindshare."
I'm confused. Are you using this as an argument for Julia?
I don't post "LOL" very often on HN, but congrats... LOL!
Unless you code in a vacuum, solving the most basic of problems that don't require integration with other services, formats, and APIs, and can figure everything out yourself, then momentum, libraries, and mindshare are EXACTLY the things you should be optimizing for.
(FWIW I think Julia is a nice language and has an amazing dev community.)
Here's a typical example of the kinds of optimizations this guide teaches you, in this case by avoiding the creation of temporary copies of Numpy arrays in memory:
That's a 2.3x speed improvement (from 3.61 ms to 1.57 ms) on a simple vector operation (your mileage will vary!).[1] This only scratches the surface. The guide goes into quite a bit of explicit detail about how Numpy arrays are constructed and stored in memory and always explains the underlying reasons why some operations are faster than others. In addition, the guide has a section titled "Beyond Numpy" that points to even more ways of improving performance, e.g., by using Cython, Numba, PyCUDA, and a range of other tools.I highly recommend reading the whole thing!
--
[1] Example copied from here: https://www.labri.fr/perso/nrougier/from-python-to-numpy/#an...