
Why Python, Ruby, and Javascript are Slow - jasonostrander
https://speakerdeck.com/alex/why-python-ruby-and-javascript-are-slow
======
DannyBee
Speaking as a compiler guy, and having a hand in a few successful commercial
JITs: The only reason he thinks they aren't slow is because they haven't yet
reached the limits of making the JIT faster vs the program faster. Yes, it's
true that the languages are not slow in the sense of being able to take care
of most situations through better optimization strategies. As a compiler
author, one can do things like profile types/trace/whatever, and deoptimize if
you get it wrong. You can do a lot. You can recognize idioms, use different
representations behind people's back, etc.

But all those things take time that is not spent running your program. On
average, you can do pretty well. But it's still overhead. As you get farther
along in your JIT, optimization algorithms get trickier and trickier, your
heuristics, more complex. You will eventually hit the wall, and need to spend
more time doing JIT'ing than doing real work to make optimizations to some
code. This happens to every single JIT, of course. This is why they try to
figure out which code to optimize. But even then, you may find there is too
much of it.

Because of this, the languages _are_ slower, it's just the overhead of better
JIT algorithms, not slower code. In practice, you hope that you can optimize
enough code well enough that nobody cares, because the ruby code takes 8ms,
and the C code takes 5ms.

For example: Almost all of the allocations and copying can be optimized, but
depending on the language, the algorithms to figure out what you can do safely
may be N^3.

Also, PyPy is still pretty young in its life cycle (in this iteration of
PyPy:P) for folks to say that they can make stuff much faster if they only had
a few things. It really needs a very large set of production apps being rin by
a very large set of folks for quite a while to see where the real bottlenecks
still are. Past a certain point, you run out of optimization algorithm
bullets. The way compilers get the last 20% is by tuning the algorithms for 10
years.

Of course, i'm not trying to slag on PyPy, I think they've done an amazing job
of persevering through multiple rewrites to get somewhere that seems to be
quite good now. I just am a little wary of a fairly young JIT saying that all
big performance problems fall into a few categories.

~~~
sb
Interesting point of view, the problem in compiler construction is well known
("Proebsting's law", though it says it's more like 18 years instead of 10.)

The issue with benchmarks is surely well known, also by the PyPy authors; I
wonder what the biggest application is that they have benchmarked or that runs
on PyPy.

Your point on the JIT compiler interrupting program execution is certainly
valid, too, but not necessarily so. One could easily do the code generation in
a separate background thread and let execution switch over only if necessary.
But, as you have already said, a _latency_ issue certainly exists. This is one
of the cases where interpreters usually have a leg up, and there are promising
ways of optimizing interpreters.

~~~
DannyBee
Yes, you could do a background thread, with some caveats:

1\. On most current CPU's, this will cause really bad cache/memory thrashing,
enough to probably impact the program.

2\. This may actually cause significant slowdown, depending on how long it
takes to optimize a given set of code (IE it may be better to spend 100ms
paused optimizing than 5000ms in the background). This is, of course, a
latency issue.

3\. State of the art for most JIT's is still to use one thread. The number of
folks doing actual parallel code generation is nil. So sadly, even if you had
4 cores, 3 empty, you'll still, at best, get to use one of them for the
background thread doing the optimizing. There are parts that are trivial to
parallelize if you've structured the JIT "right", but they aren't always the
parts that are slow.

~~~
pcwalton
Background compilation in a separate thread actually works pretty well. IE9
has been shipping it with Chakra for a while, and Firefox is now getting it
(and it improved the benchmarks a lot, especially on ARM).

~~~
DannyBee
Good to hear it's gotten better. Admittedly, I wasn't thinking about browser
based JITs when I said that :)

I'm actually curious if you have any stats on how much of the time this is
being done on actual busy machines where it's going to compete for L1/etc
resources vs how often it's able to be offloaded onto an otherwise empty core.

IE i expect their to be a significant difference in the use cases for JIT's
like PyPy, which are probably going to sit on shared servers that folks are
trying to maximize utilization of, vs desktops where I imagine most browsing
probably doesn't use all cores at 100%.

~~~
antrix
> Admittedly, I wasn't thinking about browser based JITs when I said that :)

Don't HotSpot and JRockit also do background (de)compilation & swapping of
generated code?

~~~
DannyBee
Yes, but in hotspot's case I cannot remember if it is actually turned on in
both "server" and "client"

~~~
pjmlp
Aren't server and client not now merged with tiered compilation in Hotspot?

~~~
DannyBee
No, AFAIK. "Tiered compilation, introduced in Java SE 7, brings client startup
speeds to the server VM. ... Tiered compilation is now the default mode for
the server VM. "

Again, AFAIK, the server VM still has a significantly different set of tuning
than the client VM. In particular, it runs some significantly more complex
opts that the client VM does not.

------
pcwalton
Related to this is the importance of deforestation. Some good links:

* [http://en.wikipedia.org/wiki/Deforestation_%28computer_scien...](http://en.wikipedia.org/wiki/Deforestation_%28computer_science%29)

* <http://www.haskell.org/haskellwiki/Short_cut_fusion>

Deforestation is basically eliminating intermediate data structures, which is
similar to what the "int(s.split("-", 1)[1])" versus "atoi(strchr(s, '-') +
1)" slides are about. If you consider strings as just lists of characters,
then it's basically a deforestation problem: the goal is to eliminate all the
intermediate lists of lists that are constructed. (It's something of a
peculiar case though, because in order to transform into the C code you need
to not only observe that indexing an rvalue via [1] and throwing the rest away
means that the list doesn't have to be constructed at all, but you also need
to allow strings to share underlying buffer space—the latter optimization
isn't deforestation per se.)

I don't know if there's been much effort into deforestation optimizations for
dynamic languages, but perhaps this is an area that compilers and research
should be focusing on more.

On another minor note, I do think that the deck is a little too quick to
dismiss garbage collection as an irrelevant problem. For most server apps I'm
totally willing to believe that GC doesn't matter, but for interactive apps on
the client (think touch-sensitive mobile apps and games) where you have to
render each frame in under 16 ms, unpredictable latency starts to matter a
lot.

~~~
lucian1900
Deforestation is easily done in lazy languages like Haskell.

As for GC, it would be nice to have good real time GCs in runtimes.

~~~
klodolph
> As for GC, it would be nice to have good real time GCs in runtimes.

After decades of GC research, I think the conclusion is, "Yeah, that would be
nice." Current state of the art gives us some _very nice_ GCs that penalize
either throughput or predictability. One of my favorite stories about GC is
here:

[http://samsaffron.com/archive/2011/10/28/in-managed-code-
we-...](http://samsaffron.com/archive/2011/10/28/in-managed-code-we-trust-our-
recent-battles-with-the-net-garbage-collector)

------
cschmidt
A nice talk. The punchline for me was:

    
    
        Things that take time
         •Hash table lookups
         •Allocations
         •Copying
    

Interestingly, that's exactly how you write fast C++ code. His point is that
languages like Python lack good API's for preallocating memory.

~~~
kyllo
It's how you write fast algorithms in general, in any programming language.
Minimize the number of reads and writes per iteration/recursion.

In higher-level programming languages, it's just a bit harder to control the
number of reads and writes because you're working at several layers of
abstraction above them, and are concerned with solving higher-level problems.
Use the language that provides the appropriate level of abstraction for the
problem you're trying to solve.

~~~
PeterisP
On the other hand, why should the abstraction layers prevent that? I mean,
abstraction layers abstract away the [hopefully] unimportant low-level choices
from me - but "copy or not copy" or "allocate once or allocate thrice" isn't a
choice that I need to make anyway; the abstraction layer simply should make
the 'non-copy' choice for me. Exactly the same way that the C abstraction
layer right now makes the proper opcode-ordering choices for me as good (or
better) than I can do manually in assembler.

The problem is that we haven't yet implemented those abstraction layers in
this smart way - for example, Haskell can implement 'fusion' of multiple
string operations so that they are merged together and executed without
intermediate copies; and the abstraction layer for that is exactly as high-
level as the Python examples in original poster's slides. Sure, it's
objectively hard to change core Python like that - but it theoretically can be
done, so it should&will be done.

~~~
metaphorm
its not always possible to go with the "non-copy" choice. for example, there
are very good reasons for having immutable strings, and once you've made that
choice at the language level every string function you write is going to copy
at least once.

I think Alex Gaynor is correct and that basically what is wrong at the moment
is that dynamic languages lack API's that have any sensitivity to performance
concerns. There's always going to be a hard limit based on the nature of using
a JIT vs. a static multi-pass compiler. There's always going to be a hard
limit based on fundamental language choices (implementations of primitives,
mutable vs. immutable strings, amount of overhead in object instantation,
etc.) But we're nowhere near those limits right now.

~~~
PeterisP
Compilers can work around that choice - if you want to do something with
immutable strings (like the Haskell example I mentioned, it does have
immutable strings) then you will have to make some copy, but you don't need to
make a copy per every function - if you're stringing three string functions in
a row, the compiler can "fuse" the processing so that only a single, final
copy is made, not the intermediate ones.

For any language the compiler may know which variables won't ever be used -
for example in pseudocode

    
    
      b = a.lowercase()
      c = b.replace("x","y")
      d = a.lowercase.replace("x","y")
    

both 'b' and the intermediate result in 'd' are strings, but the compiler can
flag these two 'throw-away' variables as mutable strings (while still
maintaining the promise that all programmer-visible strings will be
immutable); and you may have a special version of 'replace' standard function
that does no-copy, in-place replacement in such cases. It means extra work in
building API/stdlib, but brings better performance for the same programs.

------
njharman
Meh, MEH.

I'm almost never waiting on my python code. I'm waiting on network or disk or
database or joe to check in his changes or etc.

I'm sure there are people who do wait. But that's why numpy, c extensions, all
the pypy, psycho, and similar things exist.

Python and more broadly "scripting" languages are for speed of development.
Something else can take on speed of execution faster than 90% of people need
it to be.

~~~
tptacek
It's not an unresolved question whether idiomatic Python is slower than
idiomatic C/C++ for solving comparable problems. Python is much, much slower
than C.

~~~
timtadh
This is completely true. It is indeed well known and common wisdom.

However, I think the point the parent was trying to make is: Python is much
slower than C and many other languages, however most of the time speed is
unimportant. When it becomes important, there are many technologies to
mitigate the problem in your "hot loops."

If speed is your primary concern don't use Python et. al. If it isn't your
main concern go ahead it probably won't become an issue and if it does you
probably will be able to get around it.

~~~
tptacek
The deck is about the performance of the language.

~~~
timtadh
@chadcf and @tptacek

I was responding to @tptacek criticism of the parent not the deck. The deck is
great and it mirrors the wisdom I have picked up from optimizing my own code
over the years. I personally find it really frustrating not being able to
easily pre-alloc lists in Python. I think that having better APIs would go a
long way.

As the deck says:

"Line for line these languages _are_ fast!"

"We need better no-copy/preallocate APIs"

"Take care in data structures"

~~~
mixmastamyk
Forgive the naive question, but why not:

    
    
        l = [object()] * 100
    

Perhaps the difference is stack vs. heap?

~~~
stevejohnson
That will create a list of 100 instances of the same object.

    
    
      object[0].x = 1
      print object[1].x
      > 1
    

Edit: On second read, it looks like you're asking something other than what I
thought you were asking. Yes, you could create a list of 100 items and then
replace its elements, but that's not idiomatic.

~~~
metaphorm
but it is idiomatic in C, which is the point of the slide. C was built around
a performance focused idiom, which is to pre-allocate memory and then do in
place writes and swaps to mutate the buffer to the state you need it to be.
Python is built around an idiom of largely creating copies of objects and
appending them to dynamically allocated lists. Its a much slower idiom.

~~~
mixmastamyk
Yes, the question was regarding this line in the original post: "I personally
find it really frustrating not being able to easily pre-alloc lists in
Python."

So my mind of course wandered in the direction of how to do that.

------
defen
Back when I wanted to investigate the numeric performance of v8 I wrote a
Runge-Kutta integrator + Lorenz attractor in C and in JavaScript as a simple-
but-not-entirely-trivial benchmark. I was actually pretty impressed with how
fast the v8 version was. On the downside, it's fairly non-idiomatic js and not
that much nicer to look at than the C. Doing a million steps on my machine
takes 0.65 seconds in node.js v0.8.4, 0.41 seconds in C compiled with gcc -O0,
and 0.13 seconds with gcc -O3. Here is the code if anyone is interested. Note
that it's not commented, not thread-safe, and doesn't free memory, so use at
your own risk :)

<https://gist.github.com/anonymous/5066486>

    
    
        gcc strange.c rk4.c; ./a.out
    
        node strange.js

~~~
masklinn
> Back when I wanted to investigate the numeric performance of v8

Straightforward numerical computations really isn't a good jit benchmark,
because numerical computations are _by far_ the easiest thing to JIT, and
JITted perfs are going to be much closer to AOT than in the general case
(unless the problem can be vectorized an the AOT compiler is vectorizing, I
don't think JITs can usually vectorize)

~~~
defen
Yup...I wanted to try to quickly measure just how good the JIT was for that
kind of stuff, to see if we were getting to the point where it's feasible to
do physics in the browser. Turns out it's "fast enough", and yet still 5x
slower than equivalent stock C.

------
moreati
Great presentation, thank you for making me aware of an aspect of Python
performance. One slide struck me as odd - the "basically pythonic" squares()
function. I understand it's a chosen example to illustrate a point, I just
hope people aren't writing loops like that. You inspired me to measure it

    
    
        $ cat squares.py
        def squares_append(n):
            sq = []
            for i in xrange(n):
                sq.append(i*i)
            return sq
    
        def squares_comprehension(n):
            return [i*i for i in xrange(n)]
        $ PYTHONPATH=. python -m timeit -s "from squares import squares_append" "squares_append(1000)"
        10000 loops, best of 3: 148 usec per loop
        $ PYTHONPATH=. python -m timeit -s "from squares import squares_comprehension" "squares_comprehension(1000)"
        10000 loops, best of 3: 74.1 usec per loop
        $ PYTHONPATH=. pypy -m timeit -s "from squares import squares_append" "squares_append(1000)"
        10000 loops, best of 3: 46.9 usec per loop
        $ PYTHONPATH=. pypy -m timeit -s "from squares import squares_comprehension" "squares_comprehension(1000)"
        100000 loops, best of 3: 8.67 usec per loop
    

I'm curious to know how many allocations/copies a list comprehension saves in
CPython/PyPy. However I wouldn't begin to know how to measure it.

~~~
ericmoritz
If you really want power, use NumPy:

    
    
        from numpy import arange
    
        def squares_numpy(n):
            a = arange(n)
            return a * a
    
        $ python -m timeit -s "from squares import squares_append" "squares_append(1000)"
        10000 loops, best of 3: 130 usec per loop
        $ python -m timeit -s "from squares import squares_comprehension" "squares_comprehension(1000)"
        10000 loops, best of 3: 95.4 usec per loop
        $ python -m timeit -s "from squares import squares_numpy" "squares_numpy(1000)"
        100000 loops, best of 3: 5.31 usec per loop

------
Zak
The creators of Common Lisp knew what Alex is talking about. Lisp is, of
course just as dynamic as Ruby, Python or Javascript, but it exposes lower-
level details about data structures and memory allocation iff the programmer
wants them.

Features that come to mind include preallocated vectors (fixed-size or
growable), non-consing versions of the standard list functions and the ability
to bang on most any piece of data in place. There are fairly few situations in
which a CL program can't come within a factor of 2 or 3 of the performance of
C.

~~~
pjmlp
In the early 80's there was a time Lisp compilers could even beat FORTRAN for
floating point computations.

<http://www.cs.berkeley.edu/~fateman/papers/lispfloat.pdf>

~~~
cliffbean
From the discussion of their most relevant benchmark (Singular value
decomposition):

    
    
      The Allegro CL 4.1 times of 3.9 seconds beat the f77 time of 4.8 [seconds];
    

Nice!

    
    
      setting on optimization for f77 brought its time down to 0.45 seconds.
    
      Thus for this system, the [LISP] compiled code can have quite comparable speed to
      that of the corresponding unoptimized Fortran in this case as well.
    

Oh really.

~~~
pjmlp
It's been a few decades since I have read the paper, so it seems my memory is
a bit fuzzy on that regard.

On the other hand, given that C always had issues to beat even unoptimized
Fortran, due to the optimization restrictions before C99, it is quite
commendable that Lisp achieves such results.

~~~
cliffbean
It's not obvious how optimized C in the '80s wouldn't have been as fast as
_unoptimized_ Fortran 77, even with whatever optimization restriction you
might be thinking of.

The way the authors of that paper talk about unoptimized code in that paper
gives the impression that they don't know what they're talking about. Your
comments here begin to put you at risk of a similar appearance.

------
wheaties
Great bit of slides. Straight and to the point. If you've ever ventured under
the hood of Python you'd see this in the code. If you've ever had to optimize
the bejeesus out of code in C++ or C, you'd know exactly the kinds of things
he's talking about.

------
kingkilr
Author/speaker here:

I don't have time to read all the comments now (thanks for all the interest
though!). I just want to say I think when the video comes out it'll answer a
lot of questions people are having.

~~~
jholman
I'm looking forward to the video. I'm also interested in proof of the "lame
myths" claims, or links to debunkings of those myths, etc. And also if you
have rants about those If There's Time topics in your last slide, I'd like to
read those too. Thanks!

------
meunier
Someone actually posting notes with slides! It's a miracle!

~~~
NateDad
yes thank god. I almost skipped it when I saw it was a slide deck, until I saw
the notes. I hate it when people link to a slide deck with no notes. It's
almost completely useless.

------
riobard
Completely agree. APIs are so important for many optimizations to pull off.

I'd really like to use a lot more buffer()/memoryview() objects in Python.
Unfortunately many APIs (e.g. sockets) won't work well with them (at least in
Python 2.x. Not sure about 3.x).

So we ended up with tons of unnecessary allocation and copying all over the
place. So sad.

------
dicroce
As a C/C++ programmer I find these slides kind of amusing... These languages
are popular because they make things simpler, and his suggestions may very
well get a nicely jit'd language on par with C, but I suspect you'll then have
the same problems C does (complexity).

~~~
mistercow
I don't think that added complexity invalidates the usefulness of high
performance APIs in high level languages. The point would not be to write
_all_ of your code to be highly performant (that would be premature
optimization), but to optimize the hot spots.

Currently, if you want to really optimize a hot spot in, say, Python, your
only real option is to write that part in C. Then you have all the additional
complexity of gluing that into your Python program, along with portability
concerns and a more complex build process. It would be _so_ much easier if
there were a way to sacrifice local simplicity and idiom for performance while
still staying in the language.

And in the case of JS, I'm not sure you have much in the way of options at all
for optimizing hot spots to reduce allocations. Maybe you could write it in C
and compile it to JS via emscripten? I don't know if that would even help
currently, but maybe if asm.js takes off. But once again, wouldn't you rather
sacrifice a _small_ amount of elegance for performance rather than switching
languages?

------
CJefferson
One main thought on this topic -- languages like Haskell and lisp also have
very poor support for direct memory control, but tend to be viewed (perhaps
untruthfully?) as much closer in performance to C than Python/Ruby.

~~~
andolanra
Haskell and languages in the ML family have a lot of opportunities for
elaborate static analysis, which often allows the resulting programs to be
quite clever about optimizing the resulting programs. As one example, the GHC
Haskell compiler uses loop fusion to combine multiple passes over a list into
a single pass with no intermediate copies of the list produced. Consequently,
Haskell code like

    
    
        map f (map g (map h someList))
    

is going to involve allocating exactly one list of the same size as someList,
while a direct translation into Python

    
    
        map(f, map(g, map(h, someList)))
    

is going to involve the creation of several intermediate lists.

~~~
koenigdavidmj
In the Haskell case you could also do something like this to avoid needing
that optimisation:

    
    
      map (f . g . h) someList
    

And in Python 3, map returns an iterator, not a list, so you aren't building
the full list until you ask for it, and you never build intermediate lists in
your example. You can do the same thing in Python 2 with the itertools.imap
function.

~~~
dons
`map` is the trivial case. There are plenty of loop compositions that are
completely non-trivial to do by hand. That's why array fusion (see e.g. repa
or vector) is a huge win.

    
    
        foldl g . scanl y . concatMap x . filter h . unfoldr k
    

Fuse that by hand.

This is why we have optimizing compilers. They do what you could have done,
only more often, and without mistakes.

------
revelation
Looking at CPython and the bytecode it uses, it's not very hard to see why it
would be slow. It's basically designed as a reference implementation, with
only very tame optimizations.

------
estavaro
My own piece of feedback based on my experience. The slides were good. But
like others, JIT is not all rosy. In V8 and Dart and .NET, code gets compiled
to native code as soon as possible. I think that's the best case scenario in
general. You then don't have to guess as much.

The author didn't mention method dispatching. I think it's an issue for many
languages. In Dart, they tried to optimize it by the specification by mostly
eliminating the need to change methods at runtime. In Ruby I watched a video
by one of the core Ruby developers and he said that in Ruby method dispatching
can be very complicated requiring up to 20 steps to resolve them.

As important as getting the best performance out of programs is to get the
programs created in the first place. That's why I'm against shying away from
larger codebases. I'm in favor of OO programming exactly because I think
getting things done comes first, even if that could complicate the
implementation of the toolset. And OO is all about layers of abstractions that
bring more performance costs with them.

That said, I absolutely abhor type annotations. They make code hideous and
decrease the opportunities for experimentations. Instead of reading a + b = c
algorithms, you may need to parse A a + B b = C c source code.

In Dart we have Optional Types. But the core developers are fond of type
annotations, so most samples they post come with them. I take relief in being
able to omit type annotations while experimenting, researching and ultimately
prototyping. Although in a way I feel like a rebel in the community for this
disregard. Thankfully there is this chance to share a community with them.

Reading the part that you don't like adding heuristics to help programs to go
faster reminded of adding types to them even if they are mostly disregarded as
in Dart.

Then again, not all "dynamic languages" are the same. Some are truly dynamic
with eval and runtime method changes. Others, not so much. Sometimes the
tradeoffs allow for other kinds of gains that could come into play like when
deploying. So there is a lot more to it than just getting the algorithms
correct.

------
csense
The example he gives for strings could be optimized to near the efficiency of
the C version by a sufficiently smart compiler:

    
    
        int(s.split("-", 1)[1])
    

If the JIT knows that s is the builtin string type and the split() method has
not been overridden [1], it can speed this up by using "pseudo-strings," where
a pseudo-string is an index and length into another string. This would require
only O(1) time and space.

Garbage-collecting pseudo-strings would be an interesting exercise, but I'm
sure it's a solvable problem [2] [3].

[1] If the preconditions for your optimization don't hold, you can always fall
back to interpreting it. As noted by the speaker, this sort of logic is
already a critical part of many JIT's including Pypy.

[2] The problem is actually GC'ing the parent. When the parent string is
gc'ed, you have to compact the orphan strings to reclaim the remaining space;
otherwise it'll be possible to write user code that uses a small finite amount
of memory in CPython but has an unbounded memory leak in your compiler.

[3] You can avoid the trickiness in [2] if the parent string can be proven to
outlive its children, which is the case in this example. You could probably
optimize a lot of real-world code, and have an easier time implementing the
compiler, if you only used pseudo-strings when they could be proven to be
shorter-lived than the parent. As a bonus, this partial GC would build some
infrastructure that could be recycled in a general implementation.

------
cheald
Kind of a poorly-named deck. It's really about why programs use features of
these languages that end up causing poor performance relative to C, rather
than why the individual VMs themselves are slow. It's no surprise that trading
the byte-precision of C for the convenience of a garbage collector and heap-
allocated data structures results in a performance decrease.

Dynamically-typed languages are often easier to program in, but require more
copying (and memory allocation) as a result. Hash tables are heap-allocated
and have to be garbage collected, but they're flexible - something you don't
get with structs. Allocating and freeing memory has a cost, and that can add
up quickly. Your primary line of optimization in most of these languages is
"avoid the GC", which really boils down to "don't allocate more than you need
to", which is sound advice in every language, scripting or otherwise.

~~~
tptacek
Did you read the deck? The GC isn't the problem; it's layout and management of
allocations that's the problem, whether you use a garbage collector or
explicit deallocation to clean up the resulting mess.

I think the idea that GC is what slows down dynamic languages has to be the
most prevalent misconception about language performance.

~~~
jholman
Yeah. I dunno about "most prevalent", though. TFA contained two "lame
excuses", both of which are things I believed to be causes of slowness in
dynamic languages, and now I have to reconsider.

    
    
        dynamic typing prevents type-based optimization
        monkey patching prevents optimization
    

I think the most common complaint I hear about GC is not that it affects
computational throughput, but that it affects _predictability_ of
computational throughput. One maybe doesn't care in scientific computing, but
game developers are always going on about how they can't use a GC language
because a stall mid-frame will knock them over 16ms/frame or 33ms/frame, which
for console certification is a project-killer.

~~~
cheald
It's worth noting that WoW uses Lua; at some point Blizzard switched from Lua
5.0 (with used a stop-the-world GC) to 5.1 (which introduced an incremental
GC) specifically because of this problem. Before, UI code could generate
excess tables, kick in the GC, and dramatically impact your framerate. The
incremental GC significantly helped this, since it permits the GC to be run
over multiple frames, reducing or even eliminating the perceptible impact on
the user experience.

------
gingerlime
Interesting slides, and good point about having better APIs.

Perhaps I'm nitpicking, but with a function called `newlist_hint`, I struggle
to see how anybody would adopt it. I had to go back to the slides maybe 3
times, and I still don't remember the name of this function... Those APIs must
have the most obvious, logical and simple names.

------
wting
I have a few comments about some of the slides, feel free to correct any
misunderstandings.

Dictionary vs Object:

Lookups in both data structures is O(1), the difference being the hashing cost
(and an additional memory lookup for heap) vs a single memory lookup on the
stack (1 line of assembly).

Squares list:

> ... so every iteration through the list we have the potential need to size
> the list and copy all the data.

This is no different than stl::vector which has an amortized cost of O(1) for
a push_back().

It's not going to be as fast as C, but I'd also argue for a generator version
instead:

    
    
        def squares(n):
            return (i*i for i in xrange(n))
    

One of the main reasons people choose Python is for expressiveness and _not_
manually managing memory, although pre-allocation does seem like a good idea.

~~~
kragen
You mean std::vector, from the STL. And yes, the amortized cost is O(1) per
element and thus O(N) in total, but the constant factor and lower-order terms
(the O(1) time to do the allocation and garbage-collect it later) do matter.

------
oscargrouch
Its time to face it:

People start to create computer languages without carrying too much about the
target processor opcodes (because in that time processor were just getting
faster with time) and focus more on programmer convenience, and wild beasts
like python and ruby were born..

C is fast because it was created with processor awareness in mind.. pretty
simple...

these days kids are all about trying to create more and more crappy convenient
sintax languages.. and they get worry when the languages dont scale? for what
computer they design the language? from venus ?

nobody should be doing any serious software in python or ruby.. is such a
waste of talent .. use it for education.. for fun.. or for the things they are
best.. wich is not in the system/plumbing side of things

~~~
dmg8
Please don't _ever_ comment again. Thanks in advance.

------
coldtea
Speak about Python and Ruby.

Javascript is insanely fast, with V8 and its ilk.

And I'm not talking about "toy benchmarks" either, I'm talking about envolved
stuff written in plain JS (no C extensions), from the QT port to JS/Canvas, to
the h264 encoder and such. Try doing those on Python and you'll see what you
get. And of course all the toy benchmarks also agree.

Javascript with v8 is like a faster PyPy (with less performance deviation): 10
to 20 times faster than plain Python code.

Sure, you can extend Python with fast C code. But as the core languages are
concerned, JS beats CPython hands down. (Oh, and you can also extend JS with
fast C/C++ code if you need that. Node modules do it all the time).

------
mixmastamyk
Question:

    
    
        def squares(n):
            sq = []
            for i in xrange(n):
                sq.append(i*i)
            return sq
    
        A basically idiomatic version of the same in Python. No list
        pre-allocation, so every iteration through the list we have the
        potential to need to resize the list and copy all the data. That's
        inefficient.
    

Is that true? I'd expect .append() to change a pointer or two, not "resize and
copy" the list. Even an .insert() should just move pointers at the C-level...
no need to "defrag" it. I guess the key word is _potential_.

~~~
doktrin
Being fairly new to C, is appending to / dynamically growing an array really
just a matter of "a pointer or two"?

How can you take for granted the memory space past the end pointer is
available?

~~~
jholman
On an array, you can't. This means that you can't on a Python list, either.
mixmastamyk is mistaken about the implementation details.

But if you assume that "list" means "linked list", then you can just navigate
to the correct part of the list, allocate enough space for one new cell, and
stitch together a few pointers. Allocation and stitching is O(1). In general,
navigating to part of the list is O(n), but if your list is a doubly-linked
circular linked-list, or alternately if you keep a pointer to the end as a
special case, then "navigate to the end of the list" becomes also O(1). I
assume that all of this is what mixmastamyk was thinking Python was doing.

~~~
doktrin
Thanks for the detailed response.

I was in fact taking it as almost a given that Python lists were backed by
arrays under the hood.

------
irahul
Mike Pall of luajit fame has an interesting take on it.

[http://www.reddit.com/r/programming/comments/19gv4c/why_pyth...](http://www.reddit.com/r/programming/comments/19gv4c/why_python_ruby_and_js_are_slow/c8nyejd)

<quote>

While I agree with the first part ("excuses"), the "hard" things mentioned in
the second part are a) not that hard and b) solved issues (just not in PyPy).

Hash tables: Both v8 and LuaJIT manage to specialize hash table lookups and
bring them to similar performance as C structs (1). Interestingly, with very
different approaches. So there's little reason NOT to use objects,
dictionaries, tables, maps or whatever it's called in your favorite language.

(1) If you really, really care about the last 10% or direct interoperability
with C, LuaJIT offers native C structs via its FFI. And PyPy has inherited the
FFI design, so they should be able to get the same performance someday. I'm
sure v8 has something to offer for that, too.

Allocations: LuaJIT has allocation sinking, which is able to eliminate the
mentioned temporary allocations. Incidentally, the link shows how that's done
for a x,y,z point class! And it works the same for ALL cases: arrays {1,2,3}
(on top of a generic table), hash tables {x=1,y=2,z=3} or FFI C structs.

String handling: Same as above -- a buffer is just a temporary allocation and
can be sunk, too. Provided the stores (copies) are eliminated first. The
extracted parts can be forwarded to the integer conversion from the original
string. Then all copies and references are dead and the allocation itself can
be eliminated. LuaJIT will get all of that string handling extravaganza with
the v2.1 branch -- parts of the new buffer handling are already in the git
repo. I'm sure the v8 guys have something up their sleeves, too.

I/O read buffer: Same reasoning. The read creates a temporary buffer which is
lazily interned to a string, ditto for the lstrip. The interning is sunk, the
copies are sunk, the buffer is sunk (the innermost buffer is reused). This
turns it into something very similar to the C code.

Pre-sizing aggregates: The size info can be backpropagated to the aggreagate
creation from scalar evolution analysis. SCEV is already in LuaJIT (for ABC
elimination). I ditched the experimental backprop algorithm for 2.0, since I
had to get the release out. Will be resurrected in 2.1.

Missing APIs: All of the above examples show you don't really need to define
new APIs to get the desired performance. Yes, there's a case for when you need
low-level data structures -- and that's why higher-level languages should have
a good FFI. I don't think you need to burden the language itself with these
issues.

Heuristics: Well, that's what those compiler textbooks don't tell you: VMs and
compilers are 90% heuristics. Better deal with it rather than fight it.

tl;dr: The reason why X is slow, is because X's implementation is slow,
unoptimized or untuned. Language design just influences how hard it is to make
up for it. There are no excuses.

</quote>

Also interesting is his research on allocation sinking:

<http://wiki.luajit.org/Allocation-Sinking-Optimization>

~~~
ksec
Yeah i wish someone could crowd source a project so Mike could spare 20% of
his time on RubyVM.

------
jdhuang
Interesting presentation, but it can't be the whole story. Even projects like
SciPy which use the most rudimentary data structures (basically just a large
array of floats) and algorithms (sometimes just looping through the elements
in order a few times) see a considerable advantage when rewritten in C.

<http://www.scipy.org/PerformancePython>

------
edanm
Very interesting talk

Leads me to wonder - has anyone done a study of any large-scale program to
check where the slow spots are? It's not that I don't trust the speaker, he
makes excellent points and is obviously a great memeber of the community.

But it would be very interesting if he were able to say: "Using PyPy's secret
'hint' API, only in drop-dead obvious places, improved performance by a factor
of 5".

------
d0mine

        atoi(strchr(s, '-') + 1)
    

_What does this do? Finds the first instance of a -, and converts the
remainder of a string to an int. 0 allocations, 0 copies. Doing this with 0
copies is pretty much impossible in Python, and probably in ruby and
Javascript too._ </quote>

The copying could be avoided in non-idiomatic Python:

    
    
        int(buffer(s, s.find("-") + 1))

~~~
ricardobeat

       +s.substr(s.indexOf('-') + 1)

------
arocks
It is almost time that people stop referring to Languages as Fast or Slow. It
is an implementation that is fast or slow, not a language.

~~~
systematical
I think its time people stop using shitty slide decks to get their point
across.

~~~
aroberge
Didn't it occur to you that these slides were for a presentation and that
sharing them enable more people than just those that were at the presentation
be informed of their content?

I, for one, am very grateful to speakers that make the extra effort required
to share with a larger group what they have already shared (or are about to
share) with a smaller group.

~~~
alxp
Much better to turn your notes from the talk into an article than to just
throw up the slides.

~~~
koide
But much harder. Slides (with notes!) are a VERY good compromise.

Remember: the author does not owe you anything.

------
bithive123
If you want to learn more about what the Ruby VM has to do in order to execute
your code, and some of the performance challenges for Ruby implementors (such
as it's extremely flexible parameter parsing) I suggest this talk by Koichi
Sasada: <http://www.youtube.com/watch?v=lWIP4nsKIMU>

~~~
chadcf
I really wish this talk had been done by someone who spoke english, I found it
rather painful to try and get through...

Any good articles that summarize this info?

~~~
StavrosK
I find this attitude a bit entitled. The speaker does speak English, his
accent is just not very understandable.

However, opening the video and seeking to a random point, I must say that the
phrase "Ruby release policy: Ruby level compatibility" isn't doing any
Japanese speaker a favour.

~~~
jholman
You find his attitude entitled, I find your reply needlessly confrontational.

It's not like chadcf said "This talk is bullshit, that guy doesn't even speak
English". chadcf said "I experienced this difficulty, I wish that the
following thing existed, can anyone help me?" Maybe he could afford to have
done

    
    
        s/speak English/speak more fluent English/

------
kristianp
As a ruby lover, I'm interested in the ruby implementation the Author wrote
and mentioned, topaz [1]. Has anyone here tried it?

"Topaz is a high performance implementation of the Ruby programming language,
written in Python on top of RPython (the toolchain that powers PyPy)."

[1] <http://docs.topazruby.com/en/latest/>

------
jderick
I think the preallocate APIs sound like a cool idea. Perhaps there could also
be some kind of 'my hashtable is an object' hint that could let the compiler
do the same kind of optimizations on hashtables that it does on objects
(assuming that your hash keys don't change much).

------
ippa
His suggestion for better preallocate APIs made me think of this ruby patch
from Charles Nutter: <http://www.ruby-forum.com/topic/173802>

4 years later and they still discuss it, heh.

------
Nate75Sanders
He mentions that he couldn't find a pure C hash table.

<http://linux.die.net/man/3/hcreate>

~~~
SourPatch
Back in the early 2000s it seemed like a lot people were using kazlib:
<http://www.kylheku.com/~kaz/kazlib.html>

------
lsiebert
As a programmer that first learned c and still thinks like a C programmer in a
lot of ways, this actually explains a lot to me.

------
rjzzleep
i'm actually surprised noone ever talks about perl. isn't perl crazy fast
compared to the other interpreted languages?

~~~
metaphorm
no. Perl is mildly faster, not crazy faster. same order of magnitude.

------
rasmusfabbe
This is misleading and contains errors like calling C++ "C". Unless you have a
great deal of knowledge about these things already, I urge you not to learn
from this but read the slides purely for entertainment.

Question: The author claims to be a compiler author. After some digging I
haven't found any information on what compilers he has written or are part of
writing. Could someone point me to the compiler(s) Alex is involved with?
Thanks.

~~~
cschmidt
He mentioned it directly in the talk: PyPy. Or are you being snarky and saying
he doesn't know what he's talking about because PyPy isn't a "real" compiler.
Alex has made huge contributions to several open source projects. I can't
imagine too many people who know more about making Python go fast.

