
Why I Still Use Python for High Performance Scientific Computing - subnaught
http://nbviewer.jupyter.org/github/lmcinnes/hdbscan/blob/master/notebooks/Python%20vs%20Java.ipynb
======
msellout
Summary in the conclusion:

"The end result is an implementation several orders of magnitude faster than
the current reference implementation in Java. ... [Python] makes the first
version easy to implement and provides plenty of powerful tools for
optimization later when you understand where and how you need it." [edited to
be a statement instead of rhetorical question]

~~~
cygnus_a
Actually, even the subsection headings in bold give a very succinct summary:

\- Python has easy development
([https://xkcd.com/353/](https://xkcd.com/353/))

\- Great libraries (ie, free matlab)

\- Cython for efficiency via C

\- The algorithms themselves determine speediness (ie numerical methods)

~~~
dagw
_The algorithms themselves determine speediness_

This is so important I wish people would focus more on it. I recently rewrote
some Javascript code in (pure) python and got a good 2 orders of magnitude
speed up on large inputs just by picking the right data structures and
replacing an O(n^3) nested loop with an O(n log n) approach.

~~~
timClicks
Which data structures did you use? In Python, I tend to rely on dict, list,
and set for 90% or more of my code. I wouldn't want to rely on structures
written in pure Python.

~~~
dagw
Nothing exotic. One of the changes for example was replacing a list of list
with a set of tuples, which greatly sped up checking if an object was in the
collection. Another change was using a generator comprehension and an included
itertool function rather than hand rolled nested for loops.

~~~
echlebek
Once you've exhausted all the low-hanging fruit, like people calling .keys()
on dicts, or doing unnecessary linear searches, Cython really starts to shine.
I've seen it perform ~40 times better than pure Python in time-consuming
loops.

We do scientific computing at my company. Numpy does 90% of the work, but
there are some algorithms that just aren't easily expressed with arrays.
That's where Cython comes in.

~~~
IanCal
> Numpy does 90% of the work

Numpy and scipy have been the core of a huge amount of my optimisations. The
first question I try and ask is

"Could this be solved with matrix multiplications and summing?"

Often the answer is "yes" and allows you to group a huge amount of
calculations all together, and use the heavily optimised code available
numpy/scipy.

I recently swapped out something that was running at about 100 rows
calculated/second to about half a million in about 0.2s.

~~~
corysama
[http://www.vetta.org/2008/05/scipy-the-embarrassing-way-
to-c...](http://www.vetta.org/2008/05/scipy-the-embarrassing-way-to-code/) ;)

~~~
IanCal
Fantastic!

------
lmm
But you can have both. In Scala I can write prototypes just as rapidly as
Python, but I can run them with close-to-native performance. I can even
explore interactively in a REPL but backed by the power of my company's big
computer cluster, using spark-shell. The profiling capabilities are excellent,
but when I spot a bottleneck I can solve it in the language directly, without
needing the awkwardness of cython or of converting to/from numpy formats.

(And while I personally love Scala, there's nothing magic about it in this
regard. There's no reason a language can't offer Python-like expressiveness
and Java-like performance, and many modern languages do)

~~~
srean
Except for the fact that JNI is such a piece of utter... garbage ... as far as
performance is concerned. One has to think twice, thrice ...countless times
before jumping back and forth over the runtime bridge of JVM and native. Not
that Python is that good at it either, but better than JNI, almost everything
is. The best I have seen is Lua's FFI.

I consider it an accident of history that Numpy, Scipy, Pandas, Scikits got
written for Python and not Lua. Thanks to Luajit I think Lua would have been a
better choice. Now that cause has been taken up by Torch and Julia.

JVM by itself is terrible for reaching close to the FLOPS that the CPU is
capable of. Try sparse matrix multiply with it and see it for yourself.

~~~
eonwe
Could you elaborate a bit on how JVM hinders reaching maximum FLOPS when
multiplying sparse matrices?

I could think some examples where lacking SSE/AVX support would hinder it, but
I don't see the connection with sparse matrices.

~~~
srean
What make sparse matrices harder to JIT is proving that the loop bounds will
not be exceeded, so all accesses re bound checked. In any case in my
experience array bounds checking and escape analysis never gave the boost that
theory and JVM fans promise. So even normal matrix multiply will trail behind.
That said Hotspot JVM is possibly one of the most optimized VMs we have got.

A structural problem of JVM is that its runtime semantics is over-specified,
there is very little room for the JIT to do its stuff. For example function
arguments are evaluated right to left, there goes an opportunity for
parallelism.

~~~
eonwe
For normal dense matrix laid out as a double[] and accessed directly as i*
N_ROW + j probably won't get its loop check elided. For double[][] I would
_think_ that happens more easily.

But how are sparse matrices then generally laid out? A naive approach would be
some hash map, perhaps with some locality, in which I don't see JIT problems.

~~~
srean
There are some defacto standard formats such as CRS, CSC, list of tuples etc.
Layout of the third should be obvious and it is not used much for cases where
speed matters because one loses locality in this layout. For the other two
they are laid out column after column (or row by row), row ids, and offsets to
indicate the start and end of columns (rows).

~~~
eonwe
Thanks, this helped. Using CSC and CRS is probably quite problematic with JVM
bounds check elimination (or the lack of it). So if they would need to be used
on JVM, I think it would be wise to drop the safety and use _sun.misc.Unsafe_
for unchecked array access.

------
kitd
Large-scale data processing jobs normally arrange themselves into data
acquistion/cleaning, grunt numerical work and result formatting/display. These
tasks have very different requirements so a combination of a tool that can do
all the data handling easily (ie Python) + a tool that can throw the CPU at a
numerical problem (ie C) will work as a great combination.

In contrast, if you work in Java, you are trying to use the same tool for both
jobs, and you may well fall between 2 stools. And I say that as a typical
Java-head.

My only question about the 2-tool combination is whether there are better
combinations. Python has all the libraries and community support so any
alternative would need similar. Maybe Node?

As for the number crunching, I think Rust would be a better choice here. Good
memory management is its USP and that can have significant performance
benefits.

~~~
iamsohungry
> My only question about the 2-tool combination is whether there are better
> combinations. Python has all the libraries and community support so any
> alternative would need similar. Maybe Node?

Absolutely not. Python has a much more mature set of libraries which are much
better designed, better tooling, and a much more reasonable type system, and
the community has only recently begun to be polluted by Web 2.0 "move fast and
break things" mentality. The Node ecosystem is _built_ on that mentality, and
while outlier developers at the front line of the JS community are able to be
extraordinarily productive in JS/Node, developers building user-facing
programs just get bogged down in breaking changes in dependencies, buggy and
poorly-designed 0.x libraries, untraceable framework code that prioritizes
configuration over convention, and intractable errors caused by a broken type
system (this final issue is significantly improved in ES6).

Source: Worked in Python 2.x for a few years, worked in Node for a few years,
currently work in Python 3.x and Node (general design is to limit Node to
compiling and testing the browser-based part of our product). JS/Node is maybe
20% of our code, 20% of our features, 50% of our dev time, and 80% of our
bugs.

~~~
kitd
Yes you're right, Node wouldn't be good.

Then I remembered Perl and realised that data pre-/post-processing is
precisely why it was invented in the first place.

Perl + Rust would be an interesting combo IMHO.

------
pen2l
This may be a liiiitle bit off-topic, but I really need to get it off my
chest: Python for high-performance scientific computer works _beautifully_...
it's a dream. Scipy/numpy, matplotlib, pandas, ipython. They're all
unbelievably awesome. It all just works.

 _Except_ , when you're on Windows, and it just doesn't. Just installing
things and doing the 'hello world' for aforementioned libraries is laughably
impossible.

So, use Python, but use it only on Linux.

(Okay, if you absolutely must do it in Windows: Use Anaconda).

~~~
dagw
FUD. I've been using Python on windows for years and between Anaconda and
Christoph Gohlke's python packages and I've yet to run into something that
didn't just work.

~~~
pen2l
Yeah with Anaconda it's not as bad. But no-one told me that's what I needed if
I wanted all the science packages to work on Windows... I'd never even heard
of Anaconda before this. It took all of my blood and tears for weeks before I
got everything fixed. So I guess what's wrong here is the lack of
documentation.

~~~
RogerL
NumPy/SciPy's install page tells you to use Anaconda (or equivalent) for
Windows, and build from repositories for Linux.

[http://www.scipy.org/install.html](http://www.scipy.org/install.html)

Granted, it does not say "and by the way, when you do that you get tons of
other great things like pandas, matplotlib, etc". But they are very clear
about not trying to do this yourself.

------
pbowyer
> once I had a decent algorithm, I could turn to Cython to tighten up the
> bottlenecks and make it fast.

What are your preferred ways to profile Python code? Coming recently from PHP,
where we have XDebug/KCachegrind, the excellent Facebook-sponsored Xhprof,
[https://blackfire.io](https://blackfire.io) and
[https://tideways.io](https://tideways.io), it's felt a step backwards.

I've tried line_profiler, and used memory_profiler and cProfile with
pyprof2calltree and KCachegrind. I've found the cProfile output confusing when
it crosses the Python-C barrier for numpy, sklearn etc.

~~~
lqdc13
cProfile takes a while to learn how to use well.

What didn't you like about line_profiler?

Here's a good guide on how to write fast(ish) code in Python:
[https://wiki.python.org/moin/PythonSpeed/PerformanceTips](https://wiki.python.org/moin/PythonSpeed/PerformanceTips)

Generally, the best strategy for me has been to use NumPy wherever possible
and to avoid creating many complex objects. Best to use built in dicts or
tuples for things that store data. Thus the only time I run into issues is
when implementing algorithms in which case I usually isolate the slow function
and turn it into a Cython module. Recently have been playing around with
[https://github.com/jboy/nim-pymod](https://github.com/jboy/nim-pymod) which
seems like a much better solution.

~~~
pbowyer
> cProfile takes a while to learn how to use well.

I bet :) I have only scratched the surface, that's for sure.

> What didn't you like about line_profiler?

In this case I was trying to profile the overall simulation codebase to find
the slow spots (rather than guess) - a simulation that takes 60 minutes to
run. line_profiler wasn't great at giving digestible results from that - and I
haven't worked out how to write all output (across multiple modules) to file,
without specifying the file in each decorator.

I've started to break the codebase down into 'tests' to measure each algorithm
separately, and will give line_profiler another go then.

Re memory_profiler, while the mprof command showed all peaks, the line by line
output only showed the result after executing each line - so when 4GB of RAM
disappeared in a skimage call, only to be released at the end - it wasn't
reflected in the output. Which is tricy when trying to reduce overall memory
usage.

------
cballard
Why isn't Haskell, or any other functional language, popular for this sort of
thing? Turning A into B is what FP excels at, and you shouldn't have to reason
about side effects, besides writing the graph images somewhere.

From what I've heard from a friend of using other people's code in one
particular scientific field (stringly type some of the things, probably
accidentally, don't document this), an at-least-passable type system would be
a huge improvement.

~~~
Tomte
Because the Python ecosystem is huge, with real scientists writing real
libraries to get stuff done.

The Haskell crowd seems to write monad tutorials that are either cute or
unintellegible, and stratosphere-high level stuff where I wouldn't have the
slightest clue what I can use them for (Arrows? Zippers?).

~~~
j-pb
C'mon zippers are not that hard, and really useful.

Lets say you want to do processing of some xml file. Normally you'd walk the
tree and do manipulations in place. With zippers however, you can inspect
every intermediary tree result, you can rethink you problem so that you walk
the tree once to extract interesting information, then compute a changeset for
the tree, maybe merge it with a differently computed changeset, and then apply
the union of them.

I build a ocr system in clojure on zippers and it was a lot of fun. You could
for example extract a list of all the words, with line wraps removed, then do
the correction on that view/lense of your data, and reapply the changes
without having to worry to reintroduce the pesky linewraps, because they were
never removed from the original document.

------
kfk
I have in my hands a pretty interesting BI project for a big company. So far,
the proposal on the table has been .NET and SQL Server, but I am wondering if
I should at least try to give python a chance. Pandas is a great library, with
great people working on it. Django the same. On the other hand, .NET has lots
of professional (aka: with paid licenses) libraries that seem more fit for an
enterprise project. Looking from a company perspective, the drawback python
has is, strangely, the lack of paid for alternatives. It's not that people in
companies don't trust open source (hadoop is becoming big here too), but one
wonders if the developers will be able to find the support they need in case
any issue arise from a free library.

~~~
cpbotha
I have first-hand experience with a BI-ish system, squarely targeted at the
enterprise and doing quite well there, that we wrote using Django and a whole
list of open source components.

We did run into some resistance initially, because our stack is almost the
opposite in every way of what our enterprise colleagues are used to. However,
our development velocity, especially around analytical features and just in
general, has made a most gratifying impact.

(I have to add that the 5-man dev team we have working on this is stellar.
It's hard to determine scientifically what the interaction is between team
quality and the choice of a Python-oriented software stack. See Paul Graham's
essays for more discussion on that point.)

In terms of support: There are many highly professional often boutique
software agencies that can support Django systems, if you're not around. To my
mind this is even better than the normal commercial support you get from a
different vendor for each different component in your commercial enterprise
system.

~~~
kfk
It would be interesting to hear about your experience, did you do a write up
somewhere or could I email you with few questions?

My project would be to put different data sources together + to allow users to
upload their own structured data via Excel (think financial estimates). The
current system has about 450 users, the next might have much more depending if
it gets extended to other divisions.

~~~
cpbotha
You could send an email to the work address on my personal website (see HN
profile). I can't promise that I'll be able to answer everything. :)

------
banku_brougham
A very beginner Java programmer here. It's a nicely organized notebook, great
demo, but: seems like a lot of effort was put into optimizing the python
efforts, and none for Java. Isn't that an unfair comparison?

My real question is, is it so much easier to do this excercise in Python than
Java, assuming equal proficiency in either case?

~~~
netheril96
Java has no operator overloading. Many Java developers vehemently oppose
addition of operator overloading into the language, as if it were the root of
all evil. The lack of that feature results in convoluted function calls where
a clear and concise math expression would suffice. Consequently, not many
people choose Java for math tasks, and not many people write math libraries
for Java.

That is just my guess.

------
rdtsc
Agreed. Python is an excellent tool in that respect. Batteries included helps.
Being able to access fast C routines help. Compile to C projects like Numba
and Cython also help. And of course, ipython (Jupyter) notebooks for
exploration.

------
cosmoharrigan
Jake Vanderplas previously wrote an excellent blog post about Python
performance and scientific computing:
[https://jakevdp.github.io/blog/2014/05/09/why-python-is-
slow...](https://jakevdp.github.io/blog/2014/05/09/why-python-is-slow/)

------
daemonk
The interesting insight from the article is that python might be a good
language for learning algorithms. The fast development time allows you to
write a complete program (albeit clunky) without the pre-optimizing you might
be tempted to do in other languages.

------
kriro
My current setup for "scientific computing" is RStudio and CSV files for
quickly running typical stats-tests (a couple of t-tests and a tost +
krippendorff's alpha the last couple of month) and python+libraries for
anything that resembles "building stuff" (mostly scikit-learn to build some
classifiers). I mostly use R "as a consumer" i.e. I basically use RStudio
whenever my colleagues fire up SPSS. That combination works fairly well. I'd
recommend it to anyone who enters academia in any field that involves
statistics who doesn't want to use the typical proprietary tools (I've also
tried PSPP and it works ok for basic tasks but lacks a lot of functionality.
If all you want to do is run a quick t-test or ANOVA it's a decent tool).

------
hcrisp
Good article. Couldn't find who wrote it since it doesn't have a byline. I'm
guessing it was Leland McInnes?

~~~
a_bonobo
I think you're correct, the source .ipynb has only one contributor:

[https://github.com/lmcinnes/hdbscan/commits/master/notebooks...](https://github.com/lmcinnes/hdbscan/commits/master/notebooks/Python%20vs%20Java.ipynb)

------
SeanDav
Couple of questions:

\- Could this be/Was this developed in Python 3.x

\- what is this "notebook" he keeps on referring to?

~~~
TallGuyShort
The "notebook" he refers to is the web page itself. It's a format that's
becoming popular among Python / data science folks: a web-based, often
interactive layout with code and results embedded in it. See "Jupyter
Notebook", the viewer for which is what you're viewing as you read this.

------
buildops
Absolutely and it is even easier if you use Ceemple for your IDE

------
bipin_nag
I use Spark. Will using Python help a lot ?

------
boulos
If Numpy, Pandas, etc. were wrappable from JavaScript this could have easily
been titled "Why I use Node.js for High Performance Scientific Computing".

The "Python" here isn't particularly material to the result, it's mostly a
wrapper around C. Toss in Cython, and now you've really gone outside the
bounds of "I'm just using 'Python' for HPC!".

I agree some of the tooling and niceties are beyond a doubt best in breed with
Python, but it's disingenuous to equate this to "writing HPC code in Python".
If you had written a RPython to Verilog translator that produced an FPGA of
your algorithm would you call that "using Python"?

~~~
wesm
Completely bizarre attitude (creator of pandas here).

~~~
sukulaku
Why is it bizarre?

His point is that the article shouldn't make it sound like "Python is fast",
because the speed actually comes from the libraries that have been implemented
in C.

~~~
wesm
By bizarre I mean impractical and unhelpful. What's the point of programming
at all if we cannot leverage abstractions to make ourselves more productive?

I believe what the article says is that "Python has tools that enable a savvy
user to achieve better results with less effort". Python is extremely popular
in HPC settings (including supercomputers) for this reason. I see nothing
disingenuous.

~~~
sukulaku
Well, the word "bizarre" has a commonly understood meaning, but you somehow
decided to use it to mean something completely different. _That 's_ a bit
bizarre :P

But the article is titled: "Why I Still Use _Python_ for High Performance
Scientific Computing", and it gives the impression that _Python_ \- the
_language_ \- is fast enough for HPSC.

In reality, the reason why he _" still"_ uses Python is that _the libraries_
are fast enough for HPSC. But that's not what people see when reading posts
like this.

The message they see is that "Python is fast", not that _some_ of its 3rd-
party libraries are fast.

But I should probably stop repeating myself here.

~~~
andreasvc
First, there's no reason to only associate "Python" with "the language", it is
an environment, ecosystem, etc. It's not interesting to narrowly focus on the
efficiency of the interpreter.

Second, it _is_ an inherent feature of the design of CPython that its C API
allows tight integration with external libraries in C. Cython does not just
glue C and Python together, it does this in a way which makes the integration
easier and safer than doing it by hand (e.g., generating correct reference
counting code). All of this is a direct benefit of Python and its scientific
ecosystem.

~~~
sukulaku
> First, there's no reason to only associate "Python" with "the language", it
> is an environment, ecosystem, etc. It's not interesting to narrowly focus on
> the efficiency of the interpreter.

Sure, but that's how people construe the post, which I think the author knows
too.

The post could have been accurately titled _" Python has certain libraries
that are fast enough for HPSC"_, but that wouldn't have generated nearly as
much interest as _" Why I still use Python for HPSC"_.

People want to see others say good things about their favourite _language_ ,
which in this case is Python.

I knew where the article was going (with the libraries), but still wanted to
read it because even I wanted to see someone compliment Python itself, because
it used to be my "primary language".

Python is a good language, but to the extent you're _not_ using those highly
optimized "C-libraries", it just doesn't perform well, and it's not suited for
concurrency. That's a part of why I switched to Clojure.

~~~
ajuc
Obivously clojure, being written in Java (at least the intersting parts -
collections and multithreading) - isn't fast (if Python isn't fast).

~~~
sukulaku
Right, but I wouldn't claim _it_ is. I'd say something like "The _JVM_ is
fast" :)

