
Python’s Weak Performance Matters - georgecmu
https://metarabbit.wordpress.com/2018/02/05/pythons-weak-performance-matters/
======
passive
This is a weird article at this point in time.

The question it addresses:

"Does Python's performance matter?"

Has always had the answer:

"Sometimes, and you have options for those cases."

The OP found a "sometimes", and he's using one of those options. In this case,
he's got Python for prototyping and glue, with Haskell improving performance.
This is as it should be.

I don't know of any Python advocates who say it's the right tool for every
part of every job. What we will say is it's usually a good "first" tool for
every job. Building a system out in Python allows you to get something
representative fairly quickly, which helps identify if there are areas where
Python alone is not enough.

~~~
gameswithgo
I would argue that performance always matters, and that Python is never the
right tool for the job in an absolute sense.

Python may be the right tool for the job _given the options we have today_ but
there is no reason we cannot have a language exactly as nice to use as Python
is, but that also provides good performance. Languages like Nim or F#
approximate that ideal, for instance. And while I realize there are high perf
variants of Python, these should be the primary standard, and only path. The
slow path shouldn't exist.

It is a failure of our community that we allow languages to proliferate while
remaining slow. This is bad because allowing slow tools to become popular
means people create slow things, which wastes other peoples time and energy.

Electron becoming a standard way to make cross platform desktop apps is
another example. Someone is not wrong to choose electron for that job, but we
the software community are wrong to have let something that inefficient become
the easiest way to do that job.

You cannot simply dismiss this issue by saying "Well don't use the slow
software you don't like then" as many of these things become de-facto
standards that you cannot avoid. Your place of work may require Microsoft
Teams as the chat software, and now you are using a huge % of your laptops ram
and battery for simple text transmission. Atom becomes the popular target for
language plugins and ends up the only usable way to get good IDE features for
your language, and you suffer the performance hit for it.

We can do better!

~~~
lmm
> It is a failure of our community that we allow languages to proliferate
> while remaining slow. This is bad because allowing slow tools to become
> popular means people create slow things, which wastes other peoples time and
> energy.

Make it work, then make it work right, then make it work fast. I mean yes, a
lot of things are slower than they should be, but the level of outright
_correctness_ bugs in software today is mindblowing. So while replacing our
tools with faster tools should be something we do in the long term, I'd put a
higher priority on lowering defect rates and making it easier to produce
working software.

~~~
blub
This "make it work" adage makes no sense at all when scrutinized.

Does anyone believe that other industries think like that? First let's invent
a washing machine that washes, but occasionally sets clothes on fire and takes
24h for a washing cycle. Then we'll redesign it so it doesn't set things on
fire, and finally redesign it yet again to finish in 2h.

It's incredibly wasteful. For any complex project, making it work and fast
only at the end will either result in massive cost overruns or an outright
canceled project.

~~~
brians
Yeah. Speed is the dominant cost in software, but other industries treat other
costs similarly. To pick another darling, look at Tesla: the Roadster is
expensive, flammable, suitable only for enthusiasts. The S is generally
useful, but too expansive. Still a narrow market. The E is their first general
purpose product.

That’s normal. Same deal with Apple II, Mac, iPhone. Same deal with ether,
coarse general anesthetic, modern mixes.

Yes, this is completely normal, and software based products should expect to
evolve similarly.

~~~
blub
All of the things you mentioned worked and had good enough performance. There
was no crappy version of the iPhone which ran out of battery in 1h and took
seconds to refresh when scrolling.

So obviously they thought about the performance aspect from the start.

~~~
dagw
_There was no crappy version of the iPhone which ran out of battery in 1h and
took seconds to refresh when scrolling._

I'm pretty sure there probably was. They where just smart enough to make sure
the only people that saw that version where a handful of engineers in their
R&D lab.

------
skywhopper
The quoted argument that "easy to write but slow languages are better because
programmer time is far more costly than CPU speed" was pretty common, and I
honestly think correct, 10-15 years ago. But things have changed.

CPU performance long ago hit physical limits, and more and more we are scaling
out applications across hundreds, thousands, or millions of servers. We've
passed the inflection point where CPU speed really is more expensive than
programmer time, if you are running that code at a big enough scale.

Add in containerization and cloud VM platforms where the tradeoffs of space
and performance versus money start to become very clear. Add in better, safer
languages like Rust and Go for writing high-performance code. And today if you
can spend three times the programming time writing in a faster and more
efficient language, and it runs 100x as fast in a tenth the memory footprint,
you are talking about massive overall cost savings.

~~~
theli0nheart
> _CPU performance long ago hit physical limits, and more and more we are
> scaling out applications across hundreds, thousands, or millions of servers.
> We 've passed the inflection point where CPU speed really is more expensive
> than programmer time, if you are running that code at a big enough scale._

When you're starting a startup, scaling out your application to hundreds,
thousands, or millions of servers isn't something you're going to do right off
the bat, and more likely, that's never going to happen, no matter _how_
successful your company is. The number of companies operating at that sort of
scale can be counted on one hand. Most startups can run on a few boxes.

For those situations (which probably pertains to most new projects outside of
big companies), programmer time is indeed, still the most important and costly
input. If you're a startup and your 100x more performant Go code takes an
extra 3-6 months to write, and a competitor beats you to market, no one is
going to care how much faster your runtime performance is on the CPU,
especially if you're a web app, when CPU time is probably should be last on
the list of items which could lead to slow application performance to end
users.

100x difference in CPU time is nothing compared to the 1000x loss in a cache
hit or a 10000x disk read. I'd love to see an example where the CPU difference
outweighs any influence from disk or memory.

~~~
_dps
I was with you up until the last paragraph. Taken literally you seem to
suggest there is no such thing as a CPU-bound workload. That's obviously not
the case (cryptography is just one such example), but I would agree that many
people think they are CPU-bound when they are really constrained by something
else.

Secondly, Python and the patterns its expressiveness encourages are terrible
for cache performance. In a simple C program it's easy to do something non-
trivial in the space provided by L1 cache — in Python it's quite difficult
even to reason about what's going to be in L1 if you're using any of the fancy
features.

~~~
smitherfield
_> in Python it's quite difficult even to reason about what's going to be in
L1_

The interpreter's stack?

~~~
_dps
I haven't looked at it in a while, so I could be wrong, but I think with small
enough programs you can still squeeze some payload into L1 in long tight loops
where you're not jumping up and down the Python stack a lot.

But your overall point stands: if you're writing non-trivial Python programs
your L1 is usually spent on language/runtime overhead.

~~~
kevin_thibedeau
When such things matter you drop to Cython and avoid interacting with
PyObjects. Then you get native performance for tight loops.

------
ptype
I love Python. The language is a joy, the eco system is fantastic. But yes,
let’s be honest, if you can not vectorise your code it is slow, and I think
that will be its downfall eventually.

I’m excited about Julia, I hope it gains popularity and the eco system grows.
Until then, and in particular until the data frames story can compete with
pandas, it Python with Cython for me, but I’d rather skip the Cython if it was
not necessary for performance.

Any early adopters running Julia in production with stories to share?

~~~
tekkk
I feel bit ambivalent about Python as it's a nice language for prototyping and
quickly hacking things done. Yet I'm always baffled when I read Numpy's or
Matplotlib's documentation and try to make sense of it as it can be (or at
least feel) so complex and highly ambiguous. Eg. sometimes there is no/very
brief examples at Numpy's documentation pages how the method works and most
results from Google are only about advanced implementations, not about the
basics of the method itself. In Matplotlib I still don't understand what is
the _right_ way of initializing a pyplot, there seems to be a million ways to
do it and a million parameters you can give. API changes and inconsistencies
too pain me at times (Pandas comes to my mind). While not a fault of Python as
a language I think they greatly contribute to the experience of using Python.

Also I don't feel like the culture of Python programming focuses too much on
documenting things which makes reading code at times like transcribing ancient
Latin manuscripts. Maybe a good analogy would be JS back in the days with
global jQuery scripts. Too unrestricted and free-form maybe. I'd wish Python
became more like Kotlin with very clear patterns and great IDE support (in
addition to PyCharm).

Well those are at least my experiences and feel free to disagree with me.

~~~
narimiran
> _In Matplotlib I still don 't understand what is the right way of
> initializing a pyplot, there seems to be a million ways to do it and a
> million parameters you can give. API changes and inconsistencies too pain me
> at times_

Matplotlib has the worst API of all Python libraries I have used over the
years!

If there were a fork of it that got rid of Matlab-way of doing things (keeping
only OOP style) and with consistent names (no more `twowords` and
`two_words`), I would gladly switch in a heartbeat.

------
dahart
> It is true that programmer time is more valuable than computer time, but
> waiting for results to finish computing is also a waste of my time (I
> suppose I could do something else in the meanwhile, but context switches are
> such a killer of my performance that I often just wait).

In film CG production, we had a rule of thumb. If running an interactive
program takes longer than about ten seconds, the artist (user) becomes more
likely than not to get up and go get some coffee or talk to someone else. We
consciously made an effort to keep anything someone needed to wait for to
under ten seconds, and save anything longer than that for nightly farm
renders. We were writing the code in C, btw.

~~~
cbcoutinho
Somewhat tangential to this, but I remember reading an article on HN about a
group creating some web app that was constrained by some size limit (50 KB or
so). Groups ended up putting 'dead code' in their projects to guard against
other groups taking their space while they were working on their feature. I
think this made it so the app never got less than 50...

Did you ever see something similar where devs put in some code as filler to
make room for future features they were working on?

~~~
dahart
Oh yeah, this is very common in game development, and I'm pretty sure in other
embedded dev too. There's a famous story about a game programmer who saved an
entire production when it started crashing something like two weeks before
shipping by commenting one line of code. Turned out the line of code was a
malloc (of like a megabyte) that he'd added a year earlier, in anticipation of
the game running out of memory. My details are probably wrong, but I'm pretty
sure I've seen this story linked on HN.

I was in game dev for a decade, and I saw this happen where I worked, the
studio technical director adopted the practice of saving some space, because
we _always_ started running out of memory near the deadline as artists threw
in all their content.

*edit: [http://www.dodgycoder.net/2012/02/coding-tricks-of-game-deve...](http://www.dodgycoder.net/2012/02/coding-tricks-of-game-developers.html)

------
narimiran
If you would like 10-100x faster performance than Python, but would like to
keep the easy-to-read code, give Nim [0] a try.

I do all my work in Python, and I've been using Nim in last couple of months -
it took me a week or two until I was able to be productive in Nim.

Don't expect Python's large ecosystem, nor some Python goodies, but if you're
looking for a readable, writable, high-performance post-Python language - Nim
is the way to go!

[0] [https://nim-lang.org/](https://nim-lang.org/)

~~~
lmm
Why Nim rather than e.g. Haskell (mentioned in the article) or OCaml, which
are much more mature and have much bigger, more established tool/library
ecosystems?

~~~
narimiran
> _Why Nim rather than e.g. Haskell?_

Because Nim syntax will be familiar to Python developer. Sometimes all you
need to change is add variable declarations and rename `def` to `proc`.

Haskell has a much steeper learning curve. Been there, struggled with that. If
I would recommend a functional language to Python developer, I would go with
F#.

------
matt_wulfeck
> _The result is that I find myself doing more and more things in Haskell,
> which lets me write high-level code with decent performance (still slower
> than what I get if I go all the way down to C++, but with very good
> libraries)._

This strikes me as an odd conclusion to come to if speed was the main
motivator.

~~~
luispedrocoelho
OP here.

Speed is the main motivation, but total time is TimeToWriteCode +
TimeToRunCode.

Python has the lowest TimeToWriteCode, but very high TimeToRunCode. C++ has
lowest TimeToRunCode, but high TimeTowWriteCode. Haskell is often a good
compromise for me.

Also, with Haskell, it can be very easy to take advantage of 20 CPU cores,
while I don't have as much familiarity with high-level C++ threading
libraries.

~~~
aldanor
@ the OP - not to sound hostile, but you write code (like in the example here
[1]) that is bound to be slow, just from a glance at it. vstacking, munging
with pandas indices (and pandas in general), etc; in order for it to be fast,
you want pure numpy, with as little allocations happening as possible. I help
my coworkers “make things faster” with snippets like this all the time.

If you provide me with a self-contained code example (with data required to
run it) that is “too slow”, I’d be willing to try and optimise it to support
my point above.

Also, have you tried Numba? It maybe a matter of just applying a “@jit”
decorator and restructuring your code a bit in which case it may get magically
boosted a few hundred times in speed.

[1]
[https://git.embl.de/costea/metaSNV/blob/master/metaSNV_post....](https://git.embl.de/costea/metaSNV/blob/master/metaSNV_post.py#L331)

~~~
luispedrocoelho
That is the _FAST_ version of the code (people keep saying "of course, it's
slow", when it's the fast version).

Here is an earlier version (intermediate speed):
[https://git.embl.de/costea/metaSNV/commit/ff44942f5f4e7c4d0e...](https://git.embl.de/costea/metaSNV/commit/ff44942f5f4e7c4d0e04aaf72bcd4feb1a645afb#ca7d49b27cf92be478d916df4f3b59edf91ff0b5_328_328)

It's not so easy to post the data to reproduce a real use-case as it's a few
Terabytes :)

*

Here's a simple easy code that is incredibly slow in Python:

    
    
        interesting = set(line.strip() for line in open('interesting.txt'))
        total = 0
        for line in open('data.txt'):
            id,val = line.split('\t')
            if id in interesting:
               total += int(val)
    

This is not unlike a lot of code I write, actually.

~~~
proto-n
I've also found that loops with dictionary (or set) lookups are a pain point
in python performance. However, this example strikes me as a pretty-obvious
pandas use-case:

    
    
        interesting = set(line.strip() for line in open('interesting.txt'))
        total=0
        for c in chunks: # im lazy to actually write it
            df = pd.read_csv('data.txt', sep='\t', skiprows=c.start, nrows=c.length, names=['id','val'])
            total += df['val'][df['id'].isin(interesting)].sum()
    

I'm not exactly sure, but _pretty_ sure that _isin()_ doesn't use python set
lookups, but some kind of internal implementation, and is thus really fast.
I'd be quite surprised if disk IO wasn't the bottleneck in the above example.

~~~
luispedrocoelho
`isin` is worse in terms of performance as it does linear iteration of the
array.

Reading in chunks is not bad (and you can just use `chunksize=...` as a
parameter to `read_csv`), but pandas `read_csv` is not so efficient either.
Furthemore, even replacing `isin` with something like
`df['id'].map(interesting.__contains__)` still is pretty slow.

Btw, deleting `interesting` (when it goes out of scope) might take hours(!)
and there is no way around that. That's a bona fides performance bug.

In my experience, disk IO (even when using network disks) is not the
bottleneck for the above example.

~~~
proto-n
Ok, I said I wasn't sure about the implementation, so I looked it up. In fact
`isin` uses either hash tables or np.in1d (for larger sets, since according to
pandas authors it is faster after a certain threshold). See
[https://github.com/pandas-
dev/pandas/blob/master/pandas/core...](https://github.com/pandas-
dev/pandas/blob/master/pandas/core/algorithms.py#L411)

------
ChrisSD
I don't think it's true to say that Python's core developers are uninterested
in performance. Speeding up Python is a hard problem. He mentions PyPy but
even that has only managed modest performance gains in some areas (and not
without tradeoffs). He suggests JavaScript as a comparison but doesn't
elaborate on how they're comparable beyond the superficial (they're both
dynamic scripting languages).

I get that he's frustrated with Python's performance but it would be really
interesting to hear from someone who knows the technology involved rather than
simple speculation.

~~~
Animats
Python's little tin god really likes his CPython implementation being the One
True Python. Python does the things that are easy to do in an interpreter
where everything is a dictionary, and avoids things which are hard to do in
that environment. In Python, you can store into any variable in any thread
from any other thread. You can replace code being executed in another thread.
Even Javascript doesn't let you do that. This functionality is very rarely
used, and makes it really hard to optimize Python.

(And no, calling C whenever you need to go fast is not a solution. Calling C
from Python is risky; you have to maintain all the invariants of the Python
system, manually incrementing and decrementing reference counts, and be very
careful about not assuming things don't change in the data structures you're
looking at. This is not trivial.)

A generation ago, Pascal had the same problem. Wirth had an elegant recursive-
descent compiler that didn't optimize. He insisted it be the One True
Compiler, and managed to get the ISO standard for Pascal to reflect that. The
decline of Pascal followed, although Turbo Pascal for DOS, a much more
powerful dialect, had a good run, and Delphi still lives on.

~~~
orf
> Python's little tin god really likes his CPython implementation being the
> One True Python.

No, he likes it to be the _reference_ implementation, as it both is and should
be. It's simple for a reason.

> This functionality is very rarely used

It's used all the time by debuggers, and the underlying features that _allow_
you to do this is one of the most core and intrinsic things in Python.

------
b0rsuk
Is writing extensions a lost art? I read a few blog posts about speeding up
Python and Ruby with Rust extensions. This should enable rewriting only the
slow parts. Later, you could replace more of it if needed. Is writing
extensions so very problematic in practice?

I know Go has runtime issues making it not very good for mixing with other
languages, so it often encourages rewriting the _whole_ application in it.

~~~
jerf
Extensions aren't a total solution, though, which people often sell them as.
You have an impedance mismatch between Python and C code, because Python has
all of its objects packed in a way that is very strange to C, so you end up
essentially deserializing all objects into C, then back out into Python, in a
very expensive and allocation-heavy (on both sides) conversion.

If you can set up your computation in Python and run it in C, as with a lot of
NumPy code, you can have your entire program basically run at C speeds. But if
you have a complicated algorithm in Python, perhaps implementing business
logic, you can very easily see a _slowdown_ if you try to move bits of that
logic into C piecemeal, as you end up paying more in cross-language
serialization and overhead than you can win back.

In addition, writing extensions can be hard. First you've got a maze of
choices nowadays, and while many of them are quite good at what they do, it
can be difficult to figure out whether you're going to do something they
aren't good at and have to switch options later, and it's really hard to
figure out how to even analyze what they are and are not good at when you're
not already familiar with the space. Then, if you _do_ end up having to delve
into the raw C, it's very tedious code, very tricky code to deal with the
PyObjs, _and_ code that can segfault the interpreter instantly if you don't
get it right, which is not what you want to read about your multi-hour
processing code. And for a greenfield or very young product, this maze of
options is competing against other ecosystems where you can simply implement
your code and get it to run 20-50x faster while writing code that is easier to
write than a Python extension.

They _are_ a solution to some problems. I don't deny this. NumPy is an
existence proof of that statement. But I'd write this post because I'd say "if
it's slow, just write the slow bit in C!" has been oversold in the dynamic
language community now for at least the 15 years I've been paying attention,
and it still seems to be going strong.

In 2003, it may still have been a good choice; in 2018, my recommendation to
anybody writing _the sort of code where this matters_ is to pick up one of the
several languages that are simply faster to start with, and are much more
convenient (even when statically typed) to work with than the competition was
in 2003. And I also want to say that Python is still good for many things; I
still whip it out every couple of months for something; it's still on my very
short list of best languages. But the ground on which it is the _best choice_
is definitely getting squeezed by a lot of very good competition and the
changing nature of computer hardware, and a wise engineer pays attention to
that and adjusts as needed.

~~~
fafhrd91
Python extension doesn’t mean C. Rust works perfectly for extensions, it
covers a lot of low level c-api integration and it is fast. You can write
whole application in rust and use python as a glue language

[https://github.com/PyO3/pyo3](https://github.com/PyO3/pyo3)

Pyo3 library gives you ability to work both diractions. Call python code from
rust and call rust code from python.

~~~
jerf
That sounds like one of the "maze of choices" I mentioned, no?

And if you're "writing the whole application in Rust and using Python as a
glue language", you don't have the problem that this entire discussion is
about, which is when you have _Python_ code that is slow. Python as an
extension language is a completely different world. Performance problems there
are a much less big deal, because you've already got the option to simply use
the fast language with only modestly more complexity, if indeed even that
given how nice Rust is once you get used to it. It's when your whole app is in
Python that these issues emerge, and "Just write extensions" is an option far
less often than portrayed.

~~~
fafhrd91
My point is, you are not limited with using extensions only for optimizing hot
loops, in rust you can write application logic as well. I doubt you should do
this in C for example

------
munro
> At the same time, data keeps getting bigger and computers come with more and
> more cores (which Python cannot easily take advantage of), while single-core
> performance is only slowly getting better. Thus, Python is a worse and worse
> solution, performance-wise.

PySpark is makes it really easy to take advantage of multiple cores &
machines. Most operations I want to do to my data I can find in PySpark's
pyspark.sql.functions, so I get all the benefits of the JVM. In the cases I
need something from Python, I can just UDF, it's a little slower than JVM but
still extremely fast when distributed--I find all problems come down to time
or memory complexity, which is independent to whatever your programming in.
Also, it's very easy to take advantage of spot instances with Spark... I'm
usually working with 2-20 spot instances, and sometimes go up to 60 depending
on what I'm doing.

~~~
pcx
This is true, but using a Distributed system like Spark itself adds a ton of
complexity in having to understand and manage it. If one can do something with
a set of stateless processes, even if it's more performant, I feel it's a bad
idea to use a distributed system instead. Not always, but a good majority of
cases that I've seen. I've seen projects where Celery would be enough but
instead they chose to use Spark/Storm and never delivered.

------
banku_brougham
I’d like to thank the author for sharing a very practical view of problem
solving in the data science space.

Can I suggest julia? Its very easy to understand coming from python, and
performant code can be had usually in easy to read implementation of the
expressions in whatever paper you are basing your work upon.

~~~
loeg
Anytime someone brings up Julia, I think of Dan Luu's review of the language:
[https://danluu.com/julialang/](https://danluu.com/julialang/)

~~~
shele
The way Dan Luu's post is used on HN resembles a thought terminating cliche.

~~~
loeg
Dan is thorough, and I trust him to make a good faith effort to understand
things. If you'd like to refute the arguments and not the messenger, I would
love to learn more.

~~~
simonbyrne
The post is quite old: while the technical arguments certainly had merit at
the time, they have largely been addressed (the exception is probably error
handling, but his complaint there is more subjective, and I still don't think
any language really has a good answer for that one).

As to the community, I'm not exactly sure what happened with Dan (he only has
a handful of posts on GitHub and mailing lists, so it seems to have been
largely in private emails), but my experience could not have been more
different: even from the early days they have been very friendly and helpful.

(disclaimer: I now work for Julia Computing)

------
carlmr
>I used to make this argument. Some of it is just a form of utilitarian
programming: having a program that runs 1 minute faster but takes 50 extra
hours to write is not worth it unless you run it >3000 times. For code that is
written as part of data analysis, this is rarely the case.

I find this argument breaks down if you consider human psychology. Especially
with a program taking 15 seconds or 30 minutes (which is a reasonable time
span between say a C++/Rust implementation and a Python implementation in some
cases I've experienced).

With 15 seconds exec time you might stay in flow. With 30 minutes you're
almost guaranteed to have started something else. Maybe you even forget and
only get back to it the next day. All of a sudden your 30minutes delay become
a day. Then you notice you made some wrong inputs, and you lose another day.
In the other case you're still under a minute.

I find small increases in program delay often lead to big increases in time
inefficiency. It's hard to constantly context switch in and out of tasks.

I think the utilitarian argument should take human psychology into effect and
weight more towards faster programs.

------
hyperion2010
I recently discovered that pypy3 can run all my day to day Python code. It has
some issues with slightly different behavior from cpython when using threads
but other than that I see a 4x speedup on most of my slowest pure python
workloads (parsing large rdf files and reserializing them after computing a
total order on all their nodes). Huge win for productivity.

~~~
civility
> I see a 4x speedup on most of my slowest pure python workloads

Heh, only 25..250 X to go. We did a direct line for line translation of some
numerically intensive code from Python to C++ and saw a literal 1000X speedup.
On other projects, it's been more like 100X slower. That says two things:
first Python can be really slow, second, for some programs, Python doesn't
really save on lines of code over modern C++.

I've been very impressed with PyPy however. In testing, it can sometimes sneak
up to less than a factor of 2 slower than C. However, the bummer comes when it
doesn't hit that mark and you have no idea how to trick the JIT to do better.
If it works, great. If it doesn't, you don't have much insight into why.

Finally, I've always been able to get Cython to parity with C++. However, when
I'm done, I wonder what I gained. The C++ isn't that much more complicated
than adequately type annotated Cython.

~~~
joshuamorton
While I'm more of a pythonist than a C-ist, hearing "1000x speedup" and "line
for line" to me implies that you aren't writing idiomatic python. Idiomatic
python is (often) faster than not, and (often) more difficult to translate to
lower level languages.

As a simple example, list-comprehensions are faster than loops, and can't be
line for line translated into C++.

~~~
pjmlp
Depends, I bet they can be easily translated to some LINQ like implementation
in C++17.

List comprehensions are just syntax sugar for map/filter/fold.

~~~
joshuamorton
A jaunt through SO didn't give anything reasonable.

This is what SO had to offer:
[https://stackoverflow.com/questions/36339533/how-to-
generate...](https://stackoverflow.com/questions/36339533/how-to-generate-
vector-like-list-comprehension/36340883), and a quick check of new C++17
features didn't show any that would obviously improve on that.

There's nothing implicitly stopping you from doing some macro magic to
implement it, but its not there naturally.

~~~
pjmlp
SO is not the only source of truth.

    
    
        #include "cpplinq.hpp"
    
        int computes_a_sum ()
        {
            using namespace cpplinq;
            int ints[] = {3,1,4,1,5,9,2,6,5,4};
    
            auto result =    from_array (ints)
                          >> where ([](int i) {return i%2 ==0;})  // Keep only even numbers
                          >> sum ()                               // Sum remaining numbers
                          ;
            return result;
        }
    

Taken from
[https://archive.codeplex.com/?p=cpplinq](https://archive.codeplex.com/?p=cpplinq)

Done with C++11. I only referred C++17, because of the lambda improvements
done since they got introduced in C++11.

------
pletnes
There’s a project to plug different JIT compilers into CPython, so there’s
hope.
[https://github.com/Microsoft/Pyjion/blob/master/README.md](https://github.com/Microsoft/Pyjion/blob/master/README.md)

Also, I’ve more than once seen cpython beat C++/Fortran since it’s easier to
do the right algo/datastructure things, plus numpy is more optimized than most
«amateur» C loop-over-arrays.

That being said, faster python is always welcome.

~~~
gh02t
Honestly, NumPy is gonna be hard to beat even for someone knowledgeable in
certain use cases, especially ones where the overhead in Python is trumped by
time spent in library calls. It's the same reason that it's hard to beat
MATLAB or Mathematica in cases they are optimized for despite being relatively
slow languages. They are calling some of the most heavily optimized libraries
in existence (e.g., BLAS) and using heuristics to help choose the smartest
evaluation strategy.

Edit: More speed on the Python side is good though, because it gives you
flexibility. Sometimes it's hard to figure out how to do stuff optimally in
NumPy, versus just banging things out in a for loop. I've definitely done that
when I wanted something to just work, versus spending an hour figuring out
what arcane incantation I need to pass to np.einsum to get the operation I
want.

~~~
luispedrocoelho
"spending an hour figuring out what arcane incantation I need to pass to
np.einsum to get the operation I want"

Yes, I have also had this experience and I hate how in the end, the code is
very hard to read, while the for loop was probably trivial.

~~~
cjalmeida
Every time I thought I needed einsum or similar arcane ops, I found that a
Numba optimization for loop did the job.

~~~
gh02t
Numba is nice, but it's another large dependency to pull in. If I can avoid it
I will.

------
hidenotslide
I don't find this a very compelling argument. The author doesn't mention any
attempts to profile or speed up the code.

Specifically with pandas I've found if you aren't careful you can do a lot of
unnecessary copying. Not sure if that's what is going on here, but cProfile
can help find the bottlenecks.

~~~
joshuamorton
Seconding this, there are a couple of things that jump out at me as
immediately non-optimal, and which together would probably give an order of
magnitude speedup.

\- Defining compute_diversity inside a double for loop

\- `sample1.ix[sample1.index[sample1.index.duplicated()]]` appears
overengineered (I think you can just remove the `sample1.index` here (edit:
you can't , but I think you could refactor to remove the indexing and
reindexing and index resetting, and then you could))

\- Depending on the data size, swapping from `[` to `(` everywhere would give
a nice speedup just because you no longer need to store everything in
memory/swap to disk, whereas in haskell the list comprehensions would be lazy
by default. (edit: seeing as the databases downloaded are 12 and 33 GB, and
Pandas requires generally 2-3X ram, its likely that there's swapping happening
somewhere. I'd bet that using generators would be a big speed boost)

\- Overall I think genetic_distance can be significantly simplified, a lot of
the index-massaging doesn't look necessary. I could be wrong, but this looks
sloppy, and sloppy often implies slower than necessary.

Unfortunately, the provided data files are big enough that I can't easily
benchmark on my computer. I can't even fit the dataset in memory!

~~~
luispedrocoelho
You are commenting on the variant of the code that is fast enough that it
doesn't matter.

~~~
joshuamorton
While that may be true, my point is that it is almost certainly possible to
make your code go faster than it is already, and also become more readable in
the process.

And so saying that python is either slow or ugly and unreadable is perhaps an
unfair characterization. I _may_ be wrong here. I haven't benchmarked the code
in question, but I think that even for the algorithm you're trying to do, with
the special casing, that function could be significantly simplified.

Edit: I'd be curious to see example data that is passed into this function.

~~~
luispedrocoelho
That may be the case. However, my point is that we started with a rather
direct implementation of a formula in a paper. This was very easy to write but
took hours on a test set (which we could extrapolate to taking weeks on real
data!).

Then, I spent a few hours and ended up with that ugly code that now takes a
few seconds (and is dominated by the whole analysis taking several minutes, so
it would not be worth it even if you could potentially make this function take
zero time).

Maybe with a few more hours, I could get both readability and speed, but that
is not worth it (at this moment, at least).

*

The comment about the benchmark data being large is exactly my point: as
datasets are growing faster than CPU speed, low-level performance matters more
than it did a few years ago (at least if you are working, as I am, with these
large data).

~~~
joshuamorton
Right, and my point is that you could probably

1\. Have gotten similar performance boosts elsewhere, meaning that you
wouldn't have needed to refactor this function in the first place (although
the implication of a 10000x speedup means that may not be true, although I can
absolutely see the potential for 100x speedups in this code, depending on
exactly what the input data is)

2\. Its likely that there are much more natural ways to implement the function
you have in pandas more idiomatically. These would be both clearer and likely
equally fast, though possibly faster. (heck, there are even ways to refactor
the code you have to make it look a lot like the direct from the paper impl)

In other words, this isn't (necessarily) a case of python having weak
performance, its a case of unidiomatic python having weak performance. This is
true in any language though. You can write unidiomatic code in any language,
and more often than not it will be slower than a similar idiomatic method
(repeatedly apply `foldl` in haskell). I'm not enough of an expert in pandas
multi-level indexes to say that for certain, but I'd bet there are more
efficient ways to do what you're doing from within pandas that look a lot less
ugly and run similarly fast.

Granted, there's an argument to be made that the idiomatic way should be more
obvious. But "uncommon pandas indexing tools should be more discoverable" is
not the same as "python is unworkably slow".

~~~
luispedrocoelho
1\. No, that function was the bottleneck, by far, and I can tell you that
>10,000x was what we got between the initial version and the final one.

2\. I don't care about faster at this point. The function is fast enough.
Maybe there is some magic incantation of pandas that will be readable and
compute the same values, but I will believe it when I see it. What I thought
was more idiomatic was much slower.

I think this is more of a case of "the problem does not fit numpy/pandas'
structure (because of how the duplicated indices need to be handled), so you
end up with ugly code."

~~~
joshuamorton
1\. you don't get 10000x speedups by changing languages. It's likely that this
optimization would be necessary in any case.

2\. You don't care about improving the code, but you did care enough to write
an article saying that the language didn't fit your needs without actually
doing the due diligence to check and see if the language fit your needs.
That's the part that gets me.

------
Rotareti
I use a lot of Python for web stuff and I haven't been in a situation where
Python itself was the performance bottleneck. I always thought that when you
run into a situation where Python is the bottleneck, you replace the critical
bits with something like C/C++/Rust. Following this approach, you would get
the best of both worlds: rapid _proof-of-concept_ / _time-to-marked_ with the
option to improve performance critical parts later (which often isn't
necessary). Could anybody share some experience with this?

~~~
Erwin
It requires that your code is architected so performance critical sections of
Python can move into C etc. Let's say that your code creates a complex object
tree from some configuration settings, and executes Python methods and code
from all over it, using heavy OO. That is difficult to move to C++ as your
performance is spent on Python bookkeeping -- you are calling methods and thus
looking things up in dictionary, you are modifying fields and also looking up
more thing in dictionaries, increasing decreasing refcounts etc.

If you have a million 32-bit numbers that you current run Python code on,
great, you don't have to convert Python objects to C at all.

~~~
Rotareti
Thanks for the insights!

 _> Let's say that your code creates a complex object tree from some
configuration settings, and executes Python methods and code from all over it,
using heavy OO._

Luckily, this does not apply to the codebases I'm working on, which are all
quite _functional_ (no classes, no inheritance, pure functions exclusively,
immutable data types, etc.) I have the feeling that this will not strike me
that hard. If you rely on pure functions you have all the application state
that you need for the function on the parameters and you pass all new state
back through `return`. I guess all I had to do is convert the types once for
the C function call (from Python to C) and once for the `return` (from C to
Python)?

------
marmaduke
His example of a function which is unreadable, is pretty typical. It still
might be slower than a tight loop in C, but it’s only unreadable the first
time you write something like that.

That said, Numba would be a natural tool here.

~~~
klibertp
I'll buy you a beer if you can derive the original formula from this code.

~~~
marmaduke
I admit I recognized only some general NumPy things like masks, unique,
reductions, outer etc: I don’t use Pandas and not sure what the non-Numpy
stuff does.

I still don’t think it’s more obscure than equivalent for loops or FP folds or
similar.

------
fermigier
The go-to solution for speeding up Python code should always be first to use
Cython on critical sections of your Python code and tweak your code using type
annotations, at least IMHO.

~~~
hoschicz
Do type annotations really make any difference to the interpreter? I thought
that the interpreter doesn't care about what type a variable is annotated
to...

~~~
dragonwriter
> Do type annotations really make any difference to the interpreter?

Note the advice was _first_ use Cython (a compiler for a superset of Python),
and _then_ tweak as needed with type annotations. Cython’s compiler definitely
uses type annotations.

------
otorrillas
OP, I would encourage you take some courses on High-performance computing and,
specially, on Architecture Awareness in Programming. These types of courses
will help you increasing the performance of your programs by being aware of
what's running "under the hood" and ways to "help" the compiler/interpreter
better optimisations.

Although it's accurate to say that Haskell or C++ are faster than Python, by
having had a quick look on the examples you posted around here, I believe
there's still a lot of room to improve (performance-wise) on your Python code
that could bring a significant speedup boost.

However, bear in mind that you shouldn't expect Python to be close to a C++
performance unless you start using libraries such as NumPy that are,
essentially, written in C/C++.

------
ggm
To me, the critical quality exposed is the abstraction/synthesis moment of
Haskell/types/FP thinking. Python is what I use, but I rely on insights from a
Haskell person to get solutions of merit. Left to my own devices I frequently
derive python solutions with bad scaling, few and weak opportunitistic
parallelism moments and heaps of errors.

When driven to think in types and simple function composition the solutions
seem to run better

------
weberc2
Let me disclaim by saying I like Python, and I've used it for a decade and it
pays my bills.

The author claims that Python has the lowest developer cost. I used to think
that was true, and maybe it is in data science applications, but I regularly
find that I'm quite a bit more productive in Go than in Python (largely thanks
to the type checker and other static analysis tooling). As an added bonus, Go
programs are regularly 100 times faster than Python programs, and usually
Python programs are much more difficult to optimize than Go programs.

Library availability notwithstanding, starting new projects (of any
significance at all) in Python is looking like a worse and worse choice all
the time.

------
deathanatos
I agree with some of this, such as,

> _Python, it is slow as molasses. I don’t mean slower in the sense of “wait a
> couple of seconds”, I mean “wait several hours instead of 2 minutes.”_

Python can be multiple orders of magnitude slower than the equivalent in
C/C++/Rust, but,

> _more cores (which Python cannot easily take advantage of)_

Python's multiprocessing makes launching new processes (which _can_ take
advantage of more cores) pretty much as easy as launching threads.

But as a developer, I _am_ frustrated by a lot of the things people believe
are options. They are options, but … they're hard to use, and hard to take
advantage of.

* Writing a Python extension requires dropping down to C, which has so many foot-guns, I'd like to delay doing so as long as absolutely possible. Even then, you might not be able to win back that much performance, if most of your time is spent manipulating Python objects. Programmers, in my experience, also _vastly_ overestimate their ability to write correct C.

* Cython can compile Python to "C", but in Python 2 (which I am alas stuck with, despite my will; someday…), has a bug that miscompiles code dealing with metaclasses. Worse still, the latest version of six will trigger this bug. (The Cython developers do not consider this — Cython's compiled version of code behaving differently from Python and CPython — a bug.)

* rust-cpython is theoretically great, but has a bug on OS X that causes aborts (it erroneously links against the Python binary, I think, and this causes issues w/ virtual environments, where a different binary ends up getting used. I don't think this effects Linux, but I need to support OS X.)

(Throw in the enormous amount of time that I spend debugging object of type
NoneType has no attribute "static_typing", and the amount of time that I spend
wonder "what type is this variable supposed to be? and working it out by
reverse engineering the code, and I honestly wonder if Python is actually
"faster".)

[1]: [https://github.com/dgrunwald/rust-
cpython/issues/59](https://github.com/dgrunwald/rust-cpython/issues/59)

------
Insanity
does this not come down to using the right tool for the job?

I love writing in Python but would not use it for something where performance
matters.

~~~
luispedrocoelho
I think this is part of my argument: as datasets grow faster than single-core
CPU speed, performance matters more and more.

------
donarb
At PyCon US last year, Intel had a booth promoting their version of Python and
data libraries optimized for Intel processors. Wonder if any of this would
have helped.

[https://software.intel.com/en-us/distribution-for-
python](https://software.intel.com/en-us/distribution-for-python)

~~~
ddorian43
I don't think they made raw-python-code faster (like pypy) but bundle some
libraries and make them faster (like they have some if cpu=intel: enable
optimizaion they've done elsewhere).

------
Annatar
_Update: Here is a “fun” Python performance bug that I ran into the other day:
deleting a set of 1 billion strings takes >12 hours._

Buddy, it’s time for AWK.

------
coleifer
Sad not to see Cython getting a mention in this post. It is a superset of
Python, so your vanilla code will run just fine, and you can optimize slow
parts to native-speed. It's a fantastic tool.

------
ivanb
I have an impression that there are features in Python that give very little
programmer productivity but make the language slow. It should be possible to
implement a hypothetical FastPython without such features but with great
performance gains. Of course it wouldn't be compatible with most of the
libraries. I can imagine though that porting most of the libraries to
FastPython still would be a manageable task. I wonder if such projects had
been attempted.

~~~
benou
You mean like Cython [1] or RPython [2] from PyPy [3] ? [1]
[http://cython.org/](http://cython.org/) [2]
[https://rpython.readthedocs.io/en/latest/rpython.html](https://rpython.readthedocs.io/en/latest/rpython.html)
[3] [http://pypy.org/](http://pypy.org/)

~~~
ivanb
Yes, RPython looks close to what I'm talking about.

------
AllegedAlec
Interesting article. I mostly agree.

OP, could I ask a question?

You mention 1TB files. Why do you guys at embl not use a database for this
sort of stuff? I'd figure that with some proper indexing, I figure you could
see pretty decent speedups just from that already.

~~~
jerven
Not OP: but working on a downstream project and my current boss used to work
on EMBL-bank in the day. A lot of this stuff is in databases. e.g. Oracle and
I think for advanced search it was in Teradata.

However, databases are hard to share so many steps require dumping the
database into some interchange formats (custom and often from before the age
of XML or JSON, yeah for ASN.1 parsing!)

Sharing database dumps is done but commercial licenses and version mismatches
do add issues here as well. Remember EMBL/ENA is older than MySQL. The
databases tend to have the wrong shape for the next downstream step i.e. table
design is related to work flow and if your next step in working is completely
different we end up with issues. Also some data can't be published until a
certain date so that needs to be filtered from the dumps in some way.

Consider as well that this project is 3 decades old and used to be printed in
books at some point, and shipped on DVD as recently as 2004. File based
operations can be extremely efficient.

------
__s
They should use the Y shortcut, available on gitlab & github, to convert their
links to mutable master instead to a link to that immutable commit

------
cturner
Where are the main blowouts in python performance? For example, is it
compilation, evaluation overhead, or memory management?

~~~
gergo_barany
_> Where are the main blowouts in python performance?_

I did some research a few years ago that tried to quantify some of this. If
you trust my methodology, the biggest problems (depending on application, of
course) are: boxing of numbers; list/array indexing with boxed numbers and
bounds checking; and late binding of method calls. Basically, doing arithmetic
on lists of numbers in pure Python is about the worst thing you can do.

And it's not just due to dynamic typing: Even if you know that two numbers you
want to add are floats, they are still floats stored in boxed form as objects
in memory, and you have to go fetch them and allocate a new heap object for
the result.

The basic idea of my study was as follows: Compile Python code to "faithful"
machine code that preserves all the operations the interpreter has to do:
dynamic lookups of all operations, unboxing of numbers, reference counting.
Then also compile machine code that eliminates some of these operations by
using type information or simple program analysis. Compare the execution time
of the different versions; the difference should be a measure of the costs of
the operations you optimized away. This is not optimal because there is no way
to account for second-order effects due to caching and such. But it was a fun
thing to do.

The paper, with data for a set of benchmarks, is here:
[http://www.complang.tuwien.ac.at/gergo/papers/dyla14.pdf](http://www.complang.tuwien.ac.at/gergo/papers/dyla14.pdf)

As for how to improve this, I think Stefan Brunthaler did the most, and the
most successful, work on purely interpretative optimizations for Python. Here
is one paper that claims speedups between 1.5x and 4x on some standard
microbenchmarks:
[https://arxiv.org/abs/1310.2300](https://arxiv.org/abs/1310.2300)

Basically, you _can_ apply some standard interpreter/JIT optimization
techniques like superinstructions or inline caching to Python. But these
things are hard to do, they won't matter for most Python applications, and
come with a _lot_ of complications.

~~~
cturner
I have not yet done detailed study, but your paper appears to be a fabulous
resource. The context from your post is high-value also. Thanks.

------
mschaef
Great article. The way I look at it is something like the swordsman scene in
Raiders of the Lost Ark. The swordsman's doing fancy optimizations and C just
blows away the need.

If you can find a language that's x100 the performance of an interpreted
language, that speed delta will cover up lots of naivety in the code you
write.

------
arithma
Maybe there's merit in a Python version that compiles to JS where we allow
engines like V8 to do the optimisation. The dynamic nature of JS maybe enough
impedance-MATCH to allow this to happen?

Writing this up, I'll probably get someone suggesting that this is already a
reality with some tool. Glad to be taught.

------
tworc
Numba makes python highly performant, and is simple to use. Definitely worth
considering.

------
lpmay
I don't see it mentioned here, but the dask library looks like a promising
solution,. It has ways to handle to these kinds of large datasets, and
efficiently schedule computations that don't fit a numpy model. Worth a look.

------
knicholes
When I was taking a python class in school, the professor did something to
generate C code from the Python code, and it gave something like a 60%
speedup.

~~~
repsilat
Probably Cython. Sometimes it helps, sometimes it doesn't work (doesn't like
generator comprehensions iirc) but mostly it provides a "sliding scale" into C
or C++ land -- after the first compile, you can start littering type
declarations around the code and you can stop whenever you hit the speed you
want. It has been a good solution for me in the past, but probably only
because I started with a Python codebase. If I were starting afresh I wouldn't
bother.

------
stesch
Don't forget marketing.

"We need to switch from Python 2 to Python 3!" – "Yeah! We do! Faster running
programs!!" – "Well, actually …"

------
a3n
CPython is obviously the way it is on purpose, by design, and quite
successful. Yet I still find it ironic that this "slow" interpreter is written
in C, the go to, general purpose, "low level and a half" fast language, and
that "C" is right there in the name. Not knowing anything else, I might expect
a project with "C" at the front of the name would be at least fast-ish, and
that the naming was intended to signal that.

Not a complaint, just an observation.

~~~
wolf550e
What? No. The c in cpython was not meant to signal performance. The c was
added after alternative implementations were created to mean "the
original/reference implementation". Python's competitors perl ruby and php are
likewise interpreters implemented in c and are equally slow.

Anyway, what alternative did they have to implement a bytecode interpreter? C
is portable and fast enough, especially with computed goto extension.

~~~
a3n
> The c in cpython was not meant to signal performance.

I thought it was clear that I understood that: "Not knowing anything else, I
_might_ expect ..." I guess not.

------
dsign
Interesting article, and the comments.... going to research that one about
Numpy with MKL....

------
pcarolan
This is why God invented Spark.

------
WillReplyfFood
The problem here is that most language are tailored to easy learning by humans
and not to strong performance. If you had a language tailored to strong
performance, it would force you to bundle together seemingly unrelated data
into structures used in the main hot codepath of the core algorithm.

It would then force you to specify data delivery routes and processing- and
would craft a cache optimal process loop dependent on the used architecture.
The result would seem like a enormous while loop, that takes seemingly
arbitrary number of for loops to preprocess data, glued together in strange
overleafing unions to shove the endresult to the main process algorithm.

This would be the most optimal result for a processor- but to write a language
to describe this- and compilers to implement this) - the horror.

~~~
walterbell
Have you looked at [http://terralang.org](http://terralang.org)?

------
indubitable
Something I often wonder in these sort of discussions is why C# is generally
omitted. Its performance is comparable to C++, with none of the trappings. It
also does an excellent job of integrating some of the most useful features of
functional programming into an imperative language. And multi-processor
programming with the language is also incredibly simple.

But I think the best part is in programmer time. An anecdote I find endlessly
entertaining is on another forum I shared some code to solve a problem people
were having an issue with, and it was assumed my code was pseudo-code. It was
correct, compilable C#. And they're constantly adding incredibly useful
features. For instance a recent addition is more expressive tuples:

    
    
        (int number, string s, char c) triple = (2, "two", '2');
        triple.number = 13;
    

And lambda functionality is similarly clean. A lambda value might be:

    
    
        x => 2*x + 5;
    

Equivalently, as an anonymous method:

    
    
        delegate(double x){return 2*x + 5;}
    

And an example of how simple arbitrary processor count parallel programming
can be (using lambda syntax as above):

    
    
        Parallel.ForEach(listOfThings, thing => DoSomething(thing));
    
    

Yet as typical in scenarios like this one, the author sees the decision as
being between opposite extremes of C++ and Python. The only downsides of the
language I've run into are a lack of some shoot yourself in the foot features
of C++, like multiple inheritance, and the fact that template specialization
is awkward. Garbage collection is vastly overblown. My main work is with
projects that have in memory collections in the gigs of size and you'd think
the collector would be a huge issue, yet it's mostly transparent and can be
controlled if necessary - which in the vast majority of cases, is not.

~~~
pletnes
The author is a scientist analyzing his data. I never met anyone in that crowd
using C#. Are there even any good data science/numerics libs out there? C++
has a lot of number crunching libs, python even more.

~~~
pjmlp
Yes there are, IMSL and AleaGPU for starters.

[https://www.roguewave.com/products-services/imsl-
numerical-l...](https://www.roguewave.com/products-services/imsl-numerical-
libraries/net-libraries)

[http://www.quantalea.com/](http://www.quantalea.com/)

.NET is used a lot for data analysis in life sciences, as most laboratory
devices only expose COM and .dll interfaces.

Specially for stuff like DNA sequencing, chemical structure manipulation,
reaction curves analysis and such.

