
Make Lisp 15x faster than Python or 4x faster than Java - smanek
http://blog.postabon.com/make-lisp-15x-faster-than-python-or-4x-faster
======
Erwin
I ran the Python script through Shedskin, a partial Python implementation that
compiles Python to machine code via an intermediate layer of C++. I did not
have to declare any types for any variables, or add declarations for
speed/safety; Shedskin automatically determined the types.

The finished program went down in runtime from 533.7s on my machine, to 24.42
seconds. If we adjust for machine speed (since the author's Python ran the
python code in 474s), the comparative optimized result would be 21.7 seconds
compared to author's hand-optimized lisp code at 30.2.

Note that this is probably an ideal case for Shedskin (which is a relatively
new product: a 1-man job that's just a few years old compared to, I guess,
hundreds if not thousands of man-years of research that have gone in LISP
implementations).

Still, an interesting result.

~~~
pbiggar
There may be hundreds of man years of research gone into LISP, but a lot of
that research found its way into shedskin too. And the Self research is in
there too. And his own research (Shedskin was originilly his Masters).

~~~
srepmub
the main techniques that I use for type inference were invented by ole agesen
and john plevyak, who worked on self and concurrent aggregates, respectively
(iirc). it may be they were influenced by research on lisp, but I doubt it.

------
pvg
It's better to use the actual JDK (i.e. the one everyone uses) rather than the
implementation you get from debian apt-get due to debian's licensing hangups.

SBCL:

    
    
      * (test)
    
      Evaluation took:
        45.268 seconds of real time
    

Sun JDK

    
    
      Runtime was: 51086 Milliseconds
    
     

That's less than 15%, a far, far cry from '4 times'

------
mikemike
The result of the distance() function is unused and can be optimized away by
the compiler. Runs in 0.04s with LuaJIT 2.0. No type declarations needed. :-)

    
    
        local radius = 6371
    
        local function distance(latA, lngA, latB, lngB)
          local latAr = math.rad(latA)
          local lngAr = math.rad(lngA)
          local latBr = math.rad(latB)
          local lngBr = math.rad(lngB)
    
          local deltaLat = latBr - latAr
          local deltaLng = lngBr - lngAr
    
          return radius * 2 * math.asin(math.sqrt(
                  math.sin(deltaLat/2)^2 + math.cos(latAr)
                  * math.cos(latBr) * (math.sin(deltaLng/2)^2)))
        end
    
        local start = os.clock()
    
        for latA=-90,90,2.5 do
          for lngA=-90,90,2.5 do
            for latB=-90,90,2.5 do
              for lngB=-90,90,2.5 do
                distance(latA, lngA, latB, lngB)
              end
            end
          end
        end
    
        local stop = os.clock()
        print(stop-start)

~~~
jbooth
Ha.. of course, the real problem is that Java and Python are actually doing
all of the work to compute their trigonometric functions, while C libraries
that Lisp uses probably have them mapped into a table.

Java does actually keep a table for trig functions on small values -- they
just haven't widened the table definition, which is filed as a bug at
<http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=5005861>

"I just performed some simple benchmarks of my own, and noticed that summing
e.g. sqrt, sin and tan for small values are just as fast as their C
equivalents. This was a great relief to me.

However, when computing sin in the range of pi/2 ... pi and 10 ... 10+pi/6
Java became 3-5 times slower, while the C program was just as fast. This is
explained above by lack of precision, but for example sin(10.0) yields bit-to-
bit exactly the same result in Java and C! This comparison was performed on a
rather old 1GHz Mobile Intel Pentium, so I would assume that the precision has
been fixed in virtually all modern processors, and therefore these operations
could be intrinsified for a much larger value range (possibly the entire
double range)."

~~~
Estragon

      Java and Python are actually doing all of the work to compute their 
      trigonometric functions
    

Not true in the case of CPython. It uses the underlying C implementation. See
line 270 onwards.

[http://www.google.com/codesearch/p?hl=en#2T6lfGELm_A/trunk/M...](http://www.google.com/codesearch/p?hl=en#2T6lfGELm_A/trunk/Modules/mathmodule.c)

~~~
jbooth
Interesting.. well, I'd think the trig functions would be by far the most
expensive part of that program. Is Python really that much slower for control
structures? That's literally the only other computation there.

~~~
pbiggar
Well, every instruction in a Python program requires running through the
interpreter loop. Then you have to decode, get the opcode arguments, then
actually do the operation. For calling C functions from Python, you also have
to extract the C types, call the C function, and wrap the result back up. Its
not hugely slow, but yeah, it adds up, and its fairly easy to see why it
compares poorly to compiled versions.

------
revorad
_The most common way to compute your distance from a deal is to assume the
earth is a perfect sphere_

I thought you would only really be finding deals at distances small enough to
just assume the earth is flat. Or at least, aren't those the only ones people
would actually be interested in? If it's beyond my physical reach, I don't
care how far it is. I'll get it shipped.

Or am I missing something?

~~~
smanek
You're absolutely right - in my experience the pythagorean theorem has less
than ~50 meters difference from the Haversine formula for distances less than
20 miles or so. In the actual code I use a simple heuristic to decide when to
use which calculation.

Like I mentioned in the post, the distance function presented is a bit of a
simplification I invented for pedagogical purposes (my goal was to examine
languages, not geometry).

~~~
jacquesm
Can you release the source data used for the benchmark?

I'd like to compare it to C.

~~~
smanek
There is no source data.

I just iterate through each latitude from -90 to 90 and each longitude from
-180 to 180 in increments of 2.5 (for both the start and end). Check out the
source code for any of the 5 examples (included in the post).

If you'd like, post the code up somewhere and I'll be happy to run it on the
same machine and post the results along with the rest.

~~~
catch23
I also wonder why your java benchmark is slow. Java seems to be around 2-3x
slower than C on average, this would make your optimized CL to be around 1.5x
faster than C.

~~~
mahmud
Define "slower". Languages don't have a fixed "speed" relation between them.
Performance varies between machine, language, problem, and implementation
technique.

~~~
igouy
Is any of that so remarkable that we should presume catch23 doesn't know?

A charitable response would help catch23 understand what's different about
this one problem from what we can sensibly presume as common experience - not
Sun Java, JVM trig functions.

------
brlewis
Are people still calling Lisp slow in 2010? That was somewhat true when most
programming was done in C. Plenty of languages slower than Lisp have risen to
popularity since then.

~~~
drats
Indeed, I thought it was always much faster:

[http://shootout.alioth.debian.org/u32/benchmark.php?test=all...](http://shootout.alioth.debian.org/u32/benchmark.php?test=all&lang=python3&lang2=sbcl)

One of the guys behind R is also looking at moving to a LISP base after
considering python:

<http://www.springerlink.com/content/v38u176xp7j562m3/>

That said, I will stick with python as I am not a programmer by profession and
can't deal with a plethora of languages when I have a good swiss army knife as
it is.

~~~
Semiapies
There are worse ones to pick.

------
maweaver
One interesting source of data is www.spoj.pl, since they keep records of the
runtime of submissions to their site.

For example, take a look at the solutions to PRIME 1; the task is to generate
a list of prime numbers: <http://www.spoj.pl/ranks/PRIME1/>

The top 20 C solutions all take less than 0.05 seconds. The top 20 Java
solutions all take less than 0.5 second. Python has several solutions that
took 0.55 seconds. But for Lisp, only one user was able to get it to less than
1 second, and then only barely.

There's not enough data to draw any real conclusions, but it does make me
skeptical of claims that Lisp is near in speed to Java or C (any Lisp gurus
who want to prove me wrong, feel free to code up a fast solution and submit
it)

~~~
stefano
Actually, there is a solution that takes 0.52 seconds:
<http://www.spoj.pl/ranks/PRIME1/lang=LISP%20sbcl>

------
agazso
I like the Lisp approach to optimize the program by adding hints to the
runtime about types and the ability to tune the optimizer's aggressiveness
from the code. It really makes possible to write the correct algorithms first,
then later add appropriate optimizations simply by changing the optimized and
unoptimized functions.

That said I think these kinds of benchmarks (and article titles) are
misleading, because one who has deep knowledge in one language does not
necessarily write good code in other languages. Even the article notes that
"they are all just literal line-by-line translations of each other".

------
fauigerzigerk
Thanks, that's interesting. My own main optimization problem is always
reducing the memory usage of large in memory data sets. I found that languages
like Java and Python, which don't support structured value types and use
references/pointers everywhere, are basically unusable for my my purposes.

Now, I wonder whether I could use Lisp for what I do. One basic necessity
would be to have a way to define a structure like this

    
    
      struct s {
        int32_t x;
        int32_t y;  
      };
    

and then create an array of that type as struct s a[size], which uses exactly
size * sizeof(s) bytes in memory. A further requirement would be to have a
library of data structures (balanced trees, lists, hash tables) that support
storing such structs directly without using pointers everywhere.

Do you have any idea whether that could work with SBCL?

~~~
smanek
I think you can do that - although I'm not entirely certain how much overhead
a struct (defstruct) has. Array allocation has no overhead, to the best of my
knowledge (if you specify the element-type, etc).

I usually don't code at that low level, but you might want to check out
www.lrde.epita.fr/~didier/research/verna.06.imecs.pdf

------
pierrefar
In all seriousness, how does one start with setting up and learning how to
program a CL web app?

~~~
smanek
I'm glad you asked ;-) <http://blog.postabon.com/a-simple-lisp-webapp-for-
beginners>

~~~
pierrefar
Great stuff thanks!

$Weekend--;

~~~
mahmud
What platform are you on?

Let me know if you're on Win32 and I will package for you my setup, with a
double-clickable installer, and your choice of Emacs or Win32 friendly IDE :-)

~~~
iheartmemcache
Even though I do 90% of my dev work in *nix, I'd be very interested in the
emacs version of this (if only to satiate my geek hunger).

------
algorias
The python benchmark is contaminated. Lots of <= checks, increasing variables,
etc. Those things need to be left out of the timing. How about using
itertools.product instead?

    
    
      for latA, lngA, latB, lngB in itertools.product(*[xrange(-90,91), xrange(-180,181)]*2):
          distance(latA, lngA, latB, lngB)
    

Not sure it's faster, but certainly more idiomatic.

Also, the 2nd python example doesn't work; you need to drop all the 'math.'
and use radians, sqrt, etc directly.

------
neeleshs
On my laptop - Ubuntu 9.04/Python 2.6.2

Python without Psyco : 522

Python with Psyco : 190

------
saurabh
How about benchmarking the python code with NumPy?

~~~
smanek
My understanding is that you have to rewrite your code to take advantage of
NumPy. Unfortunately, I've never used NumPy before, so I don't know how to do
so.

However, if you'd like to modify my code to do so I'd be happy to rerun the
test and publish the results.

------
pvg
Isn't it a much better idea to pick a distance comparison algorithm that's not
so staggeringly slow to begin with? Why not convert (i.e. pre-compute) the
lat/lons to 3-vectors (origin center of the earth) and then compare distances
by pythagoras, no fuss, muss or trig and you can even skip the square root. If
you need to display an actual accurate distance you can do the (significantly
smaller amount of) trig to get something presentable.

~~~
regularfry
It's valid if you want to understand what'll happen when you start getting
random (lat,long) pairs thrown at you that you can't precompute.

~~~
pvg
His problem is 'i have a bunch of existing points, find and compare distances
to new point'. Doing this by lat,lon and trig is known as one of the slowest
ways to do it.

~~~
regularfry
Um... no. From the article:

"I thought I'd demonstrate this functionality, with a bit of real world code.
Postabon recommends deals to users based on a variety of factors such as the
age of the deal, other people's votes, your preferences, and (relevant for
this example) your distance from the deal. Most of the other factors can be
computed asyncronously and just cached, but your distance from deals is
computed a lot, and can't really be pre-computed since I have no way of
knowing your location a priori (well, that's not strictly true, and we do do
some memoization, but it has a relatively low hit rate)."

~~~
pvg
Yes. He has a bunch of existing points (the deals) and and a new point (user
location). Deals are the known points, which can be represented any way you
like. lat lon, vectors, cylindrical coords, you name it. The thing not known
in advance is the user location. And don't umm... me. It's rude and stupid,
especially when you don't seem to know what you're talking about.

~~~
regularfry
Yup, you're right. I had my concepts in a twist.

------
jbooth
Any guesses as to why the Java implementation is relatively slow? JNI boundary
crossings for floating point arithmetic, or lack of use of the math
coprocessor?

~~~
smanek
I don't know, but it's much slower than I expected. It's the latest Java
installable by apt on Debian (all these tests are run in Debian running under
VMWare).

    
    
      $ java -version
      java version "1.6.0_0"
      OpenJDK  Runtime Environment (build 1.6.0_0-b11)
      OpenJDK 64-Bit Server VM (build 1.6.0_0-b11, mixed mode)

~~~
jbooth
Dohhhhh don't use OpenJDK :)

Get the sun version, it's probably faster.

Although it may not be much faster, found this unfixed bug report:
<http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=5005861>

~~~
jbooth
In more depth : the sun JDK probably does a lot better job of JIT compilation,
it's the difference between running interpreted bytecode and compiled native
code.

On the Hadoop project, checksumming was a major CPU constraint, because you
have to assume every file everywhere is possibly on faulty hardware, and
checksum everything. They found that it was actually faster to implement CRC32
in pure Java and let the JIT compilation do it's magic than it was to make
native calls to zlibc.

But that may not apply to floating point if the Java implementation is bad
enough.

------
jrockway
And of course, Haskell can be faster than all of these.

~~~
alec
Not always:

[http://shootout.alioth.debian.org/u32/benchmark.php?test=all...](http://shootout.alioth.debian.org/u32/benchmark.php?test=all&lang=ghc&lang2=sbcl)

~~~
jrockway
Of course. There are some benchmarks where Perl is faster than Java 6.

~~~
igouy
Why haven't you linked to them?

~~~
jrockway
I want to discuss the nature of benchmarks, not which line of Java to change
to make one program run one second faster.

------
felideon
Another, probably more detailed article on optimizing Common Lisp code in
SBCL:

[http://t-b-o-g.blogspot.com/2009/12/brians-brain-on-common-l...](http://t-b-
o-g.blogspot.com/2009/12/brians-brain-on-common-lisp-take-3.html)

It highlights the importance of profiling code first.

------
jedbrown
1\. Looping with a floating point increment is silly here because you can't be
sure how many iterations it takes.

2\. C translation takes 11 seconds (compare to 25.6 for SBCL)

<http://gist.github.com/288936#file_silly> distance loop

~~~
pvg
Changing the loop indexes to integers is sillier because it makes the program
not only different from the original one but also slower.

~~~
jedbrown
Incorrect code is silly. It is not unlikely that the different languages are
performing a different number of iterations in these tests, correcting that by
being precise about the range to be iterated over is not silly. The integer
indices here produce the behavior that seems to be intended by the `<=`
comparison in the originals.

The speed difference is negligible since all but the inner multiplication is
hoisted out, and each inner loop takes over 250 cycles.

~~~
pvg
The speed difference is hardly negligible especially when talking about
benchmarks. It's actually about 15% on my OS X laptop, less but still
measurable on a Linux server. Your change actually has a greater impact on
execution time than off-by-ones in the iteration counts which don't really
take place.

~~~
jedbrown
15% is a bit higher than I benchmark, but here is a version that is guaranteed
to do the correct number of iterations, increments in floating point instead
of performing the slow cast, and is _faster_ than performing the comparison in
floating point math (because it occupies a different execution unit). Note
that it is also less accurate than the cast for most definitions of what this
function might be attempting to measure.

<http://gist.github.com/288980>

------
ez77
Is java known to be (about) 4x faster than python?

~~~
lliiffee
4x is low:

[http://shootout.alioth.debian.org/u32q/benchmark.php?test=al...](http://shootout.alioth.debian.org/u32q/benchmark.php?test=all&lang=python&lang2=java)

Roughly speaking, anything time-critical needs to be done in c.

~~~
ez77
Thanks! You guys rock! Also, would you therefore guess most of Google's search
engine is written in C?

~~~
lliiffee
I really don't know, but I would guess Google's search engine is almost all
C/C++. When you are operating at that scale, it makes sense to invest huge
programmer time for very small speed increases.

As to "normal" people, I do a lot of numerical computing using python+C. I end
up with probably 90% of the code in python, and 10% inner loops in C. That's
typically enough that the C code is taking, say, 50% of the time, at which it
isn't worth it (for me) to move more python into C, since I could never buy
myself more than a factor of 2 speedup.

------
bitwize
Or you could, you know, use C++: <http://stevehanov.ca/blog/index.php?id=95>

~~~
ehsanul
I wish that blog post was a joke, as well as some of the comments there.

------
anonjon
A couple of notes:

1.) A simple square macro will be faster than the expt function, because expt
does special calculations for dealing with decimal exponents. All you really
need to do is multiply it by itself. I think something like this:

(defmacro square (expression) (let ((symb (gensym))) `(let ((,symb
,expression)) (* ,symb ,symb))))

would work fine.

2.) you can turn degree to radian into a macro like so:

(defmacro degree-to-radian (deg) `(the single-float (* ,deg ,(/ (coerce pi
'single-float) 180))))

And avoid a couple of function calls and some multiplication. This way you
also don't need to duplicate 'the single-float'...

On my machine, running clozure cl, this made it about 1/8th faster than the
original optimized version.

3.) I was kind of wondering, could you do this sort of computation ahead of
time for varying longitudes and latitudes, then store them in a table and do a
simple look up and interpolation, instead of doing a lot of trigonometry (is
this technically trigonometry?) at run time?

I'm not clear on the math exactly (whether it would work in this situation),
but that sort of thing is done using pre-computed tables with fairly complex
polynomial equations all of the time. In the case of a web app, it might be
better to do this sort of thing (ram is cheap anyway).

------
wendroid
> All the benchmarks were run 5x each on the same Debian machine

I refer the honourable gentlman to Zed Shaw's "Programmers Need To Learn
Statistics Or I Will Kill Them All"

<http://www.zedshaw.com/essays/programmer_stats.html>

~~~
smanek
I'm actually a mathematician by training ;-)

I ran all tests 5 times each, and averaged the results. Each of the results
for each of the tests were within 2% of each other, and the machine wasn't
used for anything else while the tests were running.

I know it's not perfect, but I don't see any obvious sources of error ... I
did provide all the source, so you're welcome to verify my results.

~~~
Retric
It shows, you are comparing optomized LISP with unoptomized Java and C. Java
has a long startup process so it is relativly faster on 20 second benchmarks
than sub second ones. As to C, it does what you tell it do do, so the compiler
needs a little more handholding. I suspect LISP's compiler is not acctually
doing the computations because they are never used, granted I have not really
looked into it.

~~~
smanek
I don't count the startup time for any of the languages (Java included). And
the runtime for Java was over 2 minutes.

~~~
Retric
Wow, that is slow. Well java's pow function uses floats on the exponent so it
would be a lot faster if you used a temporary variable, and then multiplied it
by its self. Other than that I don't know, most of these math functions should
translate to a single ASM function so I don't know why it's that slow.

PS: I would probably inline the C function, but that's not a huge deal. My
point was each language has its own optimizations; the only way to compare
them is to write reasonably optimized programs in each of them and compare
that.

Edit: I think Java's sin functions are slower than the HW implementation but
more accurate, which is irrelevant because you don't care but it's something
to look into. I dislike Java for other reasons, but it's easy to make code
slow which code that looks very similar runs a lot faster.

~~~
jedbrown
1\. There is no C function.

2\. Inlining in an actual C implementation doesn't matter because there is no
register pressure and each trig function generates a real function call.
Function calls to known addresses are really fast (~2 cycles), even indirect
calls are fast compared to transcendental functions (~ 6 cycles). For
comparison, the C implementation takes ~260 cycles each distance computation.
This can't be improved much without vectorizing the trig functions.

~~~
Retric
1\. There is no C function.

What do you call _public static float distance(float latA, float lngA, float
latB, float lngB) {_

As I said _but that's not a huge deal._ Yes it's not a major slowdown, but a
tight inner loop with a function call is begging for an inline.

~~~
jedbrown
> import java.util.Calendar;

might be your first clue...

> but a tight inner loop with a function call is begging for an inline.

The loop body makes several other function calls, you would be very hard
pressed to measure the difference. And if it was a C function in a larger
project, you probably wouldn't want to inline it because that would slow your
compile times and cause you to lose modularity (i.e. you'd have to rebuild all
client code rather than just update the DSO when you optimize the distance
function). This function does enough work that it doesn't deserve to be
inlined.

~~~
Retric
Yep, sorry, I was not really paying attention and saw what I was expecting to
see.

