

SciPy - the embarrassing way to code - edw519
http://www.vetta.org/2008/05/scipy-the-embarrassing-way-to-code/?

======
chromoose
After spending most of the last night trying to model and fit a bunch of data
with scipy, my main bottleneck was my understanding; working out what I needed
to do. Once that was done, everything could be implemented in a single call. I
don't think that's a particularly bad bottleneck at all.

There was some good discussion about this last time this was posted, and also
current discussion over on reddit.

<http://news.ycombinator.com/item?id=182676>
[http://www.reddit.com/r/Python/comments/a2xzj/the_embarrassi...](http://www.reddit.com/r/Python/comments/a2xzj/the_embarrassing_way_to_code/)

~~~
dkarl
The problem is that programmers like to program. Make the programming part
trivial, and we start to fear that we aren't needed, and that perhaps we could
be replaced with domain experts :-O

------
hyperbovine
I had the exact same experience recently while writing a dynamic programming
simulation in Python. The naive (and elegant) solution to the problem consists
of a 20-line recursive function which took me about an hour to write.
Unfortunately function calls are very expensive in Python, and I my sim
required a couple billion of them, so I started searching for ways to
eliminate the recursion. 3 weeks and a deep foray into Numpy later and I had
it.

The second implementation is about the same length and replaces all of the
simplicity of the recursive solution with a bunch of five-dimensional arrays.
Never before did I have to think as deeply about vectorized functions, array
broadcasting, dimension matching, etc. The non-recursive version is about an
order of magnitude faster, and what's better, using the numpy operators means
nearly all of the processing is offloaded into low-level C routines, so the
Python overhead is effectively nil. On the downside the code is extremely
difficult to follow now... trying to visualize what's going on in 5 dimensions
is a real pain.

~~~
DannoHung
Might be able to save yourself some trouble if you can generalize it to an
n-dimensional set of operations. Then again, that might not work for your
situation.

------
rabidgnat
I did some SciPy work in computer vision applications, and it was both a
blessing and a curse. About 80% of the code could be reduced into matrix
operations, which provided the 'embarrassing' moments that the article
mentions. However, I'd hit something like Hysteresis Thresholding in the Canny
Edge Detector, which traces through the image looking for faint edges, and
spend a day figuring out how to write a Python extension against the Numpy
API. This was necessary, because any code that iterated over 1000 x 1000
matrices in Python that I wrote took well over 10 seconds, instead of the <
100 milliseconds it should take.

As a nublet, I probably didn't end up saving much time over C++, but I could
see that in the right hands, this tool would be a lifesaver

~~~
kurtosis
scipy.weave solved a lot of these types of problems for me. It allows you to
place inline c++ (with access to the blitz library) it does the trick for
those cases where expressing something as a matrix operation would be
wasteful.

------
ramanujan
scipy still doesn't have a good submodule for doing manipulations with
multivariate gaussians, which is a HUGE part of machine learning applications.

I'm referring to the analog of R's mvtnorm library.

Yes, you can use rpy2. Yes, you can roll your own dmvnorm (but be careful
about the degenerate situations with zero eigenvalues that always arise in
high dimensional problems with real data). It is significantly more of a pain
to roll your own pmvnorm, because you are now talking about efficient
computation of high dimensional integrals, which always seems to involve a
trip to Numerical Recipes or the like (see the Genz and Miwa references when
you do "?pmvnorm" in R).

Basically this is a huge thing to lack.

PS: yes, if you grep the codebase, you can find this:
[http://www.scipy.org/doc/api_docs/SciPy.stats.kde.gaussian_k...](http://www.scipy.org/doc/api_docs/SciPy.stats.kde.gaussian_kde.html)

But you need to write wrappers because it doesn't provide a basic interface to
the multivariate gaussian density.

I could go on about other parts of scipy -- it has some advanced stuff but
lacks a lot of basics...

~~~
SwellJoe
The definition of "basics" depends on your domain. For the fluid dynamics and
geological data that Enthought (the sponsor and primary developer of SciPy)
works with, SciPy and NumPy provides all the basics.

One could argue that if you're going to call something "SciPy", it ought to
cover all sorts of scientific computation...but, it is Open Source under a
very open license. Contributing is very easy, and the team are very friendly
to outsiders sending patches, and it's not hard to no longer be considered an
"outsider". So, if you need it and SciPy doesn't provide it, why not develop
it and contribute?

What I'm trying to say is that complaining about missing functions you need
from an Open Source project just wastes your time and annoys the pig.

~~~
ramanujan
Hey, without going into details I have contributed a fair amount of code to R.

Scipy (and the whole Python numerical computation stack) has nontrivial
competition out there and here's what it needs to do to compete...if its
developers care about adoption, which most do.

Definitely not saying scipy == crap, just that for stats/machine learning
(which is a big percentage of a lot of applications today) it is not mature.

------
fhars
"It's just a matrix" and all the comments sound like SciPy is some kind of APL
for the rest of us, is that impression correct? There you too can achieve
incredible conciseness once you manage to formulate you problem using some
matrices and vectors with funny dimension. (And you get the benefit that the
resulting code doesn't look too simple, so you get something to show off witth
in return for your effort.)

~~~
gjm11
Depends on what you think of when you read "APL".

1\. A language with matrices as a central data structure and a rich variety of
operations on them. Yup, SciPy is this, kinda, though not to the same extent
as APL.

2\. A language with an extremely strange syntax, apparently engineered for
terseness on the smallest possible scale. Nope, SciPy is basically just a
bunch of Python libraries, and the language syntax is still Python's.

(There's more to SciPy than the matrix operations, but that's what the
original poster was mostly talking about and what might prompt comparisons
with APL.)

------
ynniv
This is precisely why people are afraid of math - the more work you do, the
smaller the end result. E = mc^2 is actually an approximation, but even the
full set isn't very complicated, and imagine how difficult it is to get there.
The other truth is that less code is generally faster, but not as small or as
fast as a better algorithm, which is in turn not as small or as fast as a
mathematical equation. The poster talks about moving across this space, which
is actually a great accomplishment... but will probably get the best reception
among great programmers or mathematicians.

~~~
ars
E = mc^2 is not an approximation. What are you referring to?

~~~
danteembermage
They might have been referring to relativistic effects. Wikipedia has E^2 -
(pc)^2 = (mc^2)^2

<http://en.wikipedia.org/wiki/Mass_in_special_relativity>

------
mhartl
I once spent around ten weeks writing a C program to numerically integrate the
Schrödinger equation for an electron in a particular magnetic field. When I
started looking at results, they looked eerily like trig functions. A little
digging revealed that the problem was stated and solved (exactly) in one of
the exercises to _Quantum Mechanics_ by Landau and Lifshitz.

I sheepishly told my advisor at the time what had happened, worried he would
think me an idiot, but he was thrilled: "You solved the problem! That's what's
important."

------
buugs
I think the author is taking the wrong thing away from this, all those little
realisations are things that if he comes across in the future he will have a
better understanding of what to do.

In other words rather than become embarrassed after realising a months work
boiled down to just a few lines think about how much you learned to get to
those few lines.

~~~
markerdmann
His use of the word "embarrassing" is tongue-in-cheek.

------
Tichy
Just curious, can SciPy also automatically take advantage of cell processors
or graphics processors? If it is all just matrix operations, it seems doable?

~~~
cdavid
It cannot, at least not ATM and not out of the box.

I am by no mean knowledgeable in those areas, but my understanding is that
using specialized hardware/aggressively optimized code in numpy (or similar
tools) would be very hard, though. One of the problem is that CUDA or multi-
core optimized numpy would have a significant overhead (obviously of a totally
different nature), and would be justified only for very large problems. Multi-
code numpy was tried a few years ago by E. Jones, and the overhead was almost
never worths it IIRC.

The current solution is to use those through libraries which implement linear
algebra, etc... , which can be implemented on top of those architectures.
There is obviously quite a bit of excitement around those technologies, and
this was one of the most talked area at las scipy conference. Video are
available: <http://blog.enthought.com/?p=184>. Some of them are related to
those

