Hacker News new | comments | show | ask | jobs | submit login
Why is Python slow (kevmod.com)
279 points by zbb 359 days ago | hide | past | web | 212 comments | favorite



Database died. Google cache: http://webcache.googleusercontent.com/search?q=cache:5V0TMa0...

The gist of it is:

* Python spends almost all of its time in the C runtime

This means that it doesn't really matter how quickly you execute the "Python" part of Python. Another way of saying this is that Python opcodes are very complex, and the cost of executing them dwarfs the cost of dispatching them. Another analogy I give is that executing Python is more similar to rendering HTML than it is to executing JS -- it's more of a description of what the runtime should do rather than an explicit step-by-step account of how to do it.

Pyston's performance improvements come from speeding up the C code, not the Python code. When people say "why doesn't Pyston use [insert favorite JIT technique here]", my question is whether that technique would help speed up C code. I think this is the most fundamental misconception about Python performance: we spend our energy trying to JIT C code, not Python code. This is also why I am not very interested in running Python on pre-existing VMs, since that will only exacerbate the problem in order to fix something that isn't really broken.


> This means that it doesn't really matter how quickly you execute the "Python" part of Python. Another way of saying this is that Python opcodes are very complex, and the cost of executing them dwarfs the cost of dispatching them.

That doesn't really explain why Python is slow. Your just explaining how Python works. Why should C code be slow? Usually it is fast. Just saying the opcodes are complex doesn't really help, because if a complex opcode takes a long time, it is usually because it is doing a great deal.

Java used to have the opposite problem. It was doing too much at the "Java bytecode" level, such as string manipulation - so they added more "complex" opcodes written in C/C++ to speed things up, significantly.

What you really need to explain is why Python is inefficient. Bloated data structures and pointer hopping for simple things like adding two numbers may be a big reason. I know Perl had many efficiencies built in, and was considered quite fast at some point (90s?).


> What you really need to explain is why Python is inefficient.

Python is extremely dynamic and this makes things hard for someone who wants to build a JIT.

The powerful bits of python metaprogramming makes it really impossible for a JIT to say with some certainty across all running threads that what it is doing is right.

Inlining a simple call like a.x() is rather hard when everything underneath can move around - I am not saying that it always does, but implementing a python variant which is nearly the same isn't very useful.

Compare this to PHP, which has fixed method calls (unless you use runkit, which you shouldn't) - a->x() will always be the same method as long as there was an a->x which was valid.

The method will never change once it is has been validated.

However unlike Java, both languages end up not knowing exactly what type "a" will be when the method is being called.

Java also doesn't quite know, but only when the invoke is via an interface. But the engine at least knows exactly how many impls of that interface has been loaded so far (and the bi-morphic case is commonly 1 real impl and 1 mock impl).

But both in case of PHP and Python, the whole idea of "which object do I have look up ::x() for?" is an unknown. In PHP's case, you have to look it up once per class encountered and in Python's case, you have to verify someone hasn't replaced it at runtime.

There are very nice functional ways around this problem at the bottom end of numeric loops for Python, which makes it great for numeric processing interleaved with generic control flow.

numpy + numba is a great way of limiting all this and getting performance out of a simple loop. And I'd rather use numpy + a python script doing regexes rather than a C program + LAPACK.

But that performance doesn't translate over when you have class/object oriented structures or in general, just multi-threaded web/rpc style code.


JavaScript has the same problems, and that hasn't stopped every major JS engine from building a JIT.


JS is single-threaded which makes an enormous difference to actually squeezing performance out of your JIT.

Just building a JIT for a language generally isn't the hard part. Building a JIT that is substantially faster than a bytecode-compiled implementation of the language is what's hard, and how hard that is depends intimately on the semantics of the source language. When I say intimately, I mean every single detail of the language's semantics matter.


This article is a follow up to an earlier post (https://blog.pyston.org/2016/06/30/baseline-jit-and-inline-c...) which provides additional context. In short, python opcodes are inefficient because they must support a significant amount of dynamicity, even as opposed to some other language peers like JS, PHP, Lua, etc. The original post explains some of the cool stuff they're doing with Pyston (specifically baseline jit and inline caching) to combat it.


This is also why tensor flow is such a resource hog compared to torch. My mac book pro has intel iris graphics so no cuda, but I was able to get usable results with torch, still struggling to get tensor flow to produce anything useful. Compared to lua python is very hungry, but also Apple is going to need to have a decent GPU option in its new macbook pros to keep people who work on ai projects from bailing for hackingtoshes.


I don't know a ton about low level language implementation, so excuse me if this comment is misguided, but...

Does type hinting in Python 3.x (PEP 484) change this (i.e., reduce the overhead due to dynamicness)? I know it doesn't include runtime type checking, so perhaps it's moot but maybe someone has written an interpreter that does hard type checking either at runtime or through a compilation step.


> Does type hinting in Python 3.x (PEP 484) change this

No. The type hints are only meant for tooling (type checking, refactoring, documentation, etc.).


> Java used to have the opposite problem. It was doing too much at the "Java bytecode" level, such as string manipulation - so they added more "complex" opcodes written in C/C++ to speed things up, significantly.

Where'd you get that idea? It's because of advanced JITs that Java got a lot faster.

- former JVM hacker


This. The article does not explain at all where exactly all the processor cycles are going. It's basically just saying "it's not the Python languages' fault!" but fails to name a specific culprit.

It says it's spending the cycles in the "C Runtime" but what exactly does it (have to) do in the C Runtime that eats up performance?


> It's basically just saying "it's not the Python languages' fault!"

The article is actually saying the exact opposite. It claims Python the Language is slow because the opcodes need to do a lot of work according to the language specification. Python is not slow because the core team has done a poor job implementing the opcode interpreter and runtime.

When you have a language with thin opcodes that map closely to processor instructions then compiler improvements lead to smarter opcode generation which translates to efficient machine code after jitting. When you have fat opcodes you're SOL.

Consider this: an instruction like STORE_ATTR (implements obj.name = value) has to check whether the top of the stack refers to an object, then whether writing to the object attributes is allowed. Then "name" has to be checked if it's a string. Perhaps additional checking is needed to normalize the string or other unicode testing. Then the assignment has to happen which is a dictionary update (internally a hash map insertion which may trigger a resize). This is just the tip of the iceberg. A lot more stuff is happening and the correct exceptions are thrown when the instruction is used incorrectly (which leads to code bloat which hurts the instruction cache).

A thin bytecode instruction for STORE_ATTR could actually reduce the store to a single LEA machine code instruction (Load Effective Address).

The downside of a language with a thin instruction set is that the individual instructions can't validate their input. They have to trust that the compiler did its job correctly, a segfault or memory corruption will happen otherwise. One of Guido's goals when creating Python was that the runtime should never ever crash, even on nonsensical input. This pretty much rules out a thin instruction set right from the start. Simple concurrency is also a lot easier with a fat instruction set, because the interpreter can just yield in between instructions (Global interpreter lock). With a thin instruction set there is no clear delineation between instructions where all memory is guaranteed to be in a stable state. So a different locking model is needed for multi-threading, which adds even more complexity to the compiler and runtime.


All the problems you're describing are solved with a powerful JIT. And the core team do seem to be opposed to doing the work needed for that.


Python's philosophy chooses simplicity over everything else. Simple grammar. Simple bytecode instruction set. Simple CPython implementation. Simple threading model (GIL). Simple data structures. Adding a highly sophisticated and complicated JIT on top of that makes little sense.

It's not so difficult to create a high performance language that's much like Python. It's just not possible to make Python fast without abandoning some of its founding principles.


Why is a simple CPython implementation such an important requirement?

Portability? Make the JIT optional.

Ease of maintenance? Get a small team of experts to maintain it on behalf of everyone else.

Openness to beginners? That would be nice if possible as well, but CPython's job is to run programs rather than to educate.

A JIT needn't make the grammar, bytecode or threading model more complex. It would make data structures and the implementation more complex, but do you not think that's worth it if Python could be twice as fast?


> CPython's job is to run programs rather than to educate.

CPythons 'job' is to be the reference implementation of Python.


But that's just not the case in reality is it? In reality it's the main production implementation and its inefficiency costs the world wasted resources every day.

If readability and being the reference implementation is more important than performance, why is Python implemented in C rather than a higher level language?


> In reality it's the main production implementation and its inefficiency costs the world wasted resources every day.

Sure, but the inefficiencies in every part of the stack from the physical CPU right up to the executing program also cause waste.

> If readability and being the reference implementation is more important than performance, why is Python implemented in C rather than a higher level language?

Because like most projects it grew organically. Guido didn't sit down and write the first version of Python and think "hey, this is going to be the reference implementation for Python so lets write it in pure pseudocode so it's easy to read", he bashed out a version in C and it gained momentum over time. At the point where it became the reference implementation rather than the only implementation it would be suicide to chuck it out and re-write it in some high level language.


To be fair, the GIL wasn't included because it was a simple threading model (AFAIK). It was included because it was simple to implement and it was/is fast(er) (than removing it)[1][2].

If the Gilectomy [2] project succeeds, Guido has mentioned he would consider it for Python3.6+ [3].

[1] http://www.artima.com/weblogs/viewpost.jsp?thread=214235

[2] https://www.youtube.com/watch?v=P3AyI_u66Bw

[3] https://youtu.be/YgtL4S7Hrwo?t=10m59s


Hence why I rather support Julia and leave Python for shell scripting like tasks.


coughsufficiently smart compilercough.


No we have compilers that can do these things today.


A small nit: LEA does the calculation but doesn't read or write from that address. In times of old, this instruction used the memory addressing port to do the calculation, but these days it's just a normal arithmetic instruction with slight difference that it doesn't set the flags based on the result. Instead, the ideal would be for the bytecode to reduce to a single MOV. In addition to loading and storing, MOV itself supports several forms of indexed and indirect address calculation which execute on dedicated ports without adding latency.


Which complex bytecodes did Java introduce?

Since hotspot was introduced,the amount of C++ on the reference JDK has been incrementally reduced between releases,with Hotspot improving its heuristics.

To the point that Graal is a pure Java JIT.

Also in the 80's and early 90's C compilers generated awfully slow code.

C developers have to thank almost 40 years of research in C optimizers and misuse of UB optimizations for the current state of C compilers quality in code generation.


The author means the Hotspot JVM has added intrinsics for memset, autovectorization, etc.

I don't agree with the main point of the article though. Pypy, ZipPy, etc. have shown that there are real gains from running the actual Python code faster.


>>I know Perl had many efficiencies built in, and was considered quite fast at some point (90s?).

There are a lot of threads in Perlmonks that talk in detail about speeding up Perl, related project et al.

To be summarizing it. Languages like Perl and Python are slow because they do a lot of work out of the box that languages like C don't. There fore when you talk of talk of translating Python to C, or Perl to C. Essentially what you are talking of is translating all that extra action back into C, which will run as fast as Perl or Python itself.

The more you make it easy for the compiler to interpret the faster it can run and vice versa.

Python is slow for the very reason its famous, its easy for the programmer.


Lisp and Smalltalk like languages run circles around Python and Perl performance.

Enjoy the same powerful features, have JIT and AOT compilers to native code.

It all boils down to how much the language designers care about performance.


And also how much the language designers care about proper language design.


I'm not sure if this really answers the question though. There are plenty of languages that do more work than C does that are not as slow as python. Do they do less work than python? Maybe, but they certainly do more work than C. The question is, even if they do less work than python, is the extra work python doing valuable to you?


Lua is almost as dynamic and flexible as Python and is very easy for the programmer. LuaJIT's performance is close to that of C.


I think part of the reason for this, though, is that Lua is a very "thin" language. It purposefully doesn't have a lot of fancy features (I mean, come on, it literally has one composite datatype, and you're supposed to use them [tables] for everything from arrays to maps to full objects). By removing a lot of assumptions that make Python "easy", Lua has made it much easier to write optimizing JITers and compilers since runtime behavior is (perhaps a little paradoxically) easier to predict. And it doesn't make Lua any less "easy" a language, it's just a different set of rules to follow. Guido's hunt for the simplest programming language possible is noble, but sometimes it causes poor decisions to be made imho.


Well, just want to say for one example here, why Python is hard to speedup than JS.

In Python, essentially every object is a dictionary. And behavior like, "x.y = 'z'" is legal, meaning pretty much everything could be put into object x without declaration ahead of time, at all. While, not very sure, JS doesn't allow such behavior, you can still do assignment, but assessing will yield undefined.

The above behavior come as default(you can use __slot__ for sure, but compiler can't assume that) is pretty problematic for optimization, compiler simply won't know where to locate a specific variable ahead-of-time, so it cannot substitute the variable access using a pointer offset, it will have to go through a hash lookup, which comparing to the offset alternative, is magnitude slower.


Except that LuaJIT does something close to just a pointer offset. The trick it uses is to use a hash function that is known by the compiler so when you have a constant key ("y" in this case) the hash table set is compiled down to 3 instructions (imagine hash("y") = 0xa7):

1. Load whatever necessary to check if x[0xa7] is occupied 2. Branch to slow path if occupied 3. Store x[0xa7] = 'z'

And a modern super-scalar CPU will even do some of this in parallel.


You can generally do that in JS as well. (Added properties are sometimes called "expandos".)


True. Just find out. But Chrome's console yield undefined to me...Weird.

Edit: find the reason. It is because I am trying to assign a property to int. Seems JS won't allow this to primitive type?

Edit: Python won't allow assignment to primitive type bject either. The handling is the same.


Are you the OP author, or working on Pyston? I have basically two questions/curiosities — I'm not asking adversarially: 1) For which code is the C runtime most expensive? Typical Python code tries to leave heavy-lifting in libraries, but what if you write your inner loop in Python? Enabling that is (arguably) one goal of JIT compilation, so that you don't need to write code in C. 2) What about using Python ports of performance-sensitive libraries?

In more detail: I arrived at https://lwn.net/Articles/691243/, but I'm not sure I'm convinced. Or rather: with a JIT compiler you probably want to rewrite (parts of) C runtime code into Python so you can JIT it with the rest (PyPy has already replaced C code in their implementation, so maybe there's work to reuse). For instance, an optimizing compiler should ideally remove abstractions from here:

  import itertools
  sum(itertools.repeat(1.0, 100000000))
Optimizing that code is not so easy, especially if that involves inlining C code (I wouldn't try, if possible), but an easier step is to optimize the same code written as a plain while loop. Does Pyston achieve that? I guess the question applies to the LLVM-based tier, not otherwise.

Yes, Python semantics allow for lots of introspection, and that's expensive — but so did Smalltalk to a large extent. Yet people managed, for instance, to not allocate stack frames on the heap unless needed (I'm pointing vaguely in the direction of JIT compilers for Smalltalk and Self, though by now I forgot those details).


> Optimizing that code is not so easy, especially if that involves inlining C code (I wouldn't try, if possible)

Inlining through C code probably isn't really an option[0], but the optimisation itself shouldn't be that much of an issue, the rust version and the equivalent imperative loop compile to the exact same code: https://godbolt.org/g/OJHIwc[1]

[0] unless you interpret — and can JIT — the C code with the same underlying machinery as Truffle does

[1] used iter_arith for sum(), but you can replace sum() by an explicit fold for no difference: https://godbolt.org/g/R1BgQQ


> [0] unless you interpret — and can JIT — the C code with the same underlying machinery as Truffle does

Pyston can easily JIT the C code because it uses LLVM for it's main JIT tier.


Nope, I have nothing to do with the blog or Pyston. Sorry if I gave that impression.


You may be interested in an optimizing Python compiler called Pythran.[1][2]

[1] - https://github.com/serge-sans-paille/pythran

[2] - https://www.youtube.com/watch?v=Af8B30mXZ7E


I'm not sure your assumptions support your conclusions here, especially your believe that a JIT compiler wouldn't make inroads on the slow C code being executed.

The biggest problem of Python is that it lacks the experts that could write those fast runtimes, and it fails to attract them after the Python leaders declared the GIL to be a non-issue.


There's PyPy. I think the problem with python is the community is especially resistant to breaking changes.


There's lot of python people who were very happily breaking the backwards compability in 3.0. However I guess majority of the comunity was not in that boat, but pretty big portion anyway.


So Python runs in very low memory because of no JIT. You can run say 10 copies of your Python Web server in the space of 1 Java runtime. Will the Java runtime still win? Sure the benchmarks prove it. But there are advantages to running in low memory. Really cheap hosting for low traffic stuff comes to mind... java on a cheap host tends to be disaster.


Not because of JIT, but because of refcounting GC. Also the JVM probably loads (and does, or at least is prepared to do - http://www.azulsystems.com/blog/wp-content/uploads/2011/03/2... ) too much crap. (And probably the jigsaw and modular and whatever OpenJDK projects will help with that.)

Furthermore, Python is not really memory prudent either. PHP is much better in that regard (very fast startup time, fast page render times, but a rather different approach).


If you're running a low traffic site the hosting is a rounding error. We're talking about $5-10 here. Is that really a justification for anything? If money is THAT important you just write the app with Nginx/SQLite/Lua in the first place.


The BDFL (Benevolent Dictator For Life) Guido Von Rossum himself put forth the idea that he would consider a patch removing the GIL in 2007 [1]:

> "... I'd welcome a set of patches into Py3k only if the performance for a single-threaded program (and for a multi-threaded but I/O-bound program) does not decrease."

Unfortunately, experiments thus far have not succeeded to meet these requirements.

There is some work being done by the Gilectomy project to try and meet this bar as well as some other requirements currently though [2]. But it is currently grappling with the afore-discovered performance issues that come with removing the GIL.

Also at PyCon 2016, Guido himself mentions the Gilectomy project and it's potential consideration (if it works) for Python 3.6+ [3].

So when you say Python leaders declared the GIL a "non-issue", I think you are oversimplifying the actual reality of what removing the GIL means and why leaders (like Guido) have been reluctant to invest resources pursuing.

[1] http://www.artima.com/weblogs/viewpost.jsp?thread=214235

[2] https://www.youtube.com/watch?v=P3AyI_u66Bw

[3] https://youtu.be/YgtL4S7Hrwo?t=10m59s


That's a silly explanation. A smart JIT would be able to take successive Python opcodes and optimise them together as one operation, so how much time it spends interpreting individual instructions has nothing to do with the "speed", in the sense of "current speed" vs "potential speed".

If you want to talk about speed in any meaningful sense, you have to talk about potential for optimisation. It's possible that Python (and its opcodes) are designed in such a way that there is little potential for optimisation. This has to do with the semantics of the language, not how much time it spends in one part of the code or the other.

I don't know ultimately how much potential for optimisation Python has, but clearly it's a very difficult problem, so we can say with some certainty that Python is "slow", in the concrete sense that there are no low-hanging fruit left for speeding it up.

Edit: I say this as an avid Python user by the way. Especially combined with ctypes, I find the Python interpreter to be an absolutely excellent way to "organize" a bunch of faster code written in C/C++. I actually don't have any problem with Python itself being slow, I kind of like it that way personally. It's easy to understand its execution model if the interpreter is kept fairly simple, and this makes it easy to reason with. But then again I am not writing web-scale backends with it, I am just, more or less, using it to batch calls to scientific C functions. So it really depends on your use case. While I've spent plenty of time tuning every little performance gain out of a tight computational loop in C, I can't think of a single time where I've struggled to figure out how to speed up my Python code -- I just am not using it in ways that that would be necessary.


> That's a silly explanation. A smart JIT would be able to take successive Python opcodes and optimise them together as one operation, so how much time it spends interpreting individual instructions has nothing to do with the "speed", in the sense of "current speed" vs "potential speed".

I think you might actually be agreeing with the explanation. The point about big opcodes means that the opportunities to look at a sequence of opcodes (i.e. the Python part) are reduced because you're doing a lot of computation over a relatively small number of opcodes. So the challenge involves optimizing the "guts" of the opcodes and the sequence of "guys" across relatively few opcodes. Their approach to solving this is discussed in the original blog post (https://blog.pyston.org/2016/06/30/baseline-jit-and-inline-c...). This complication happens to make optimizing Python via JIT compilation a tough problem.


> When people say "why doesn't Pyston use [insert favorite JIT technique here]", my question is whether that technique would help speed up C code.

That's a silly question. JITs have knowledge instructions relate to each other. The "C code" you're talking about is not opaque to them, it has meanings that can be optimized in relation to other instructions. When you have a "* 2" bytecode + argument it doesn't just dispatch to a C function that multiplies the input by 2. The compiler knows the semantics of that and can convert it to a shift if appropriate.

It's not a JIT's responsibility to "speed up the C code". The C code is part of the interpreter, JITs (generally) don't invoke interpreter functions.

Or put differently, if you're spending too much time in the C runtime then maybe more code needs to be ported into JITed language itself so that code can also benefit from JIT optimizations.


Better-formatted link:

https://www.pastery.net/chjxpt/


I think this post misses the point. Having to enter the runtime in the first place is the problem (and making runtime code marginally faster is not the solution). Fast VMs for dynamically typed languages (Chakra, V8) design the object model and the runtime around being able to spend as much time in JITted code as possible - the "fast path".


Also, Python has a global interpreter lock so it has no parallelism.


See Larry Hastings' talk "Removing Python's GIL: The Gilectomy" from PyCon 2016.

https://www.youtube.com/watch?v=P3AyI_u66Bw

As well as his earlier talk "Python's Infamous GIL".

https://www.youtube.com/watch?v=4zeHStBowEk

The GIL is a simple design that provides serious benefit to Python (end user) developers.

There are plenty of options for working around this for parallelism, such as the built-in module multiprocessing.


I liked the talk, but when he talks about the GIL keeping Python "fast" I have to laugh. It's so slow already, who cares? He is talking about a factor of 1.3 to 2X to remove the GIL in a language that is already 30-100X slower than C/C++.


Not none, poor...


Some experiments I did during my coursework for my M.Sc. indicated that it was actually worse-than-useless at the time (2009?) Here's what would happen:

- When you've got one CPU core, the threads basically just act like a multiplexer. One thread runs for a while, releases the GIL, and the next thread runs for a while. Not a big deal.

- When you've got multiple CPU cores, you've got a thundering herd. When the lock is released, the threads waiting on all of the other cores all try to acquire the lock at the same time. Then after one thread has run on, say, core 3, it's gone and invalidated the cache on the other cores (mark & sweep hurts caches pretty badly). The thundering herd stampedes again and the process continues.

- To make matters even worse, each core runs at low utilization (e.g. a quad core machine, each core runs at ~25%). If you've got CPU throttling turned on (which my laptop, where I started the experiments, did), then the system detects that the CPU load is low and scales down the clock speed. Normally, this would result in increased CPU utilization, which would speed the CPUs back up again. Unfortunately, the per-core utilization stays pegged at 25% and things never speed back up again. The system looks at it and says "huh! only 25%! I guess we've got the CPU speed set properly!"

Maybe it's gotten better since then? I haven't checked recently.

Edit: I wish I had the results handy. The basic conclusion was that you got something like a 1.5x slowdown per additional CPU core. That's not how it's supposed to work! Using taskset to limit a multi-threaded Python process to a single core resulted in significant speedups in the use cases I tried.


This is a known issue in py2. On py2 when running in a multi-core machine it'll run ~1.8x slower (depending on what you are doing) than it'll run in a single-core machine. Python 3.2 ships a new GIL[0] fixing the problem.

[0] http://www.dabeaz.com/python/NewGIL.pdf


Dave Beazley! Yes, it was some of his work that inspired my research. Thanks for the reminder!

Edit: That's a beautiful solution to the problem, too. You're still not going to get a performance boost from multiple cores, but you're not going to have it fall flat on its face either.


Sounds interresting, I'd be glad to see a blog post with the actual use cases and a few graphs with different numbers of cores, and maybe the sources so people can go further.


I'll try to dig it up. I suspect it's sitting in an SVN repo somewhere...

If I recall, I took a stock Python interpreter and instrumented it with RDTSC instructions to do lightweight timestamps on GIL acquisitions and releases.


I don't think the GIL is ever released while running Python code is it? So there is no parallelism between Python threads.


You're forgetting multiprocessing. That manages to get round this by running multiple Python interpreters. Big problem is passing objects is dog slow as everything has to be pickled either way.


No true multi-threading, but multi-processing is not affected by the GIL.

Also, outside of parallelism, Python has good concurrency support with Gevent and now AsyncIO in Python 3.


Sorry for off topic, but I just tried getting the Google cache as well and it just doesn't work. Both in Firefox and Chromium, when I type "cache:<Ctrl+V><Enter>" in the google.com search bar, nothing happens. I checked the developer console, no network requests are made. It says in grey below the search box "press enter to search", but neither enter nor clicking the blue search icon does anything whatsoever.

In the past one could manually type /search?q=cache:something in the address bar and it would force the search, but these days it doesn't seem to work anymore. The search request is done in javascript and the /search?q=x link just prefills the search box, leaving it to javascript to fire the actual query (which then fails to do so).

Edit: found a way: disable Javascript. This forces Google to search immediately. No plugin necessary: in the Firefox developer console, use the cog wheel on the right top, then in the right column somewhere near the center you can tick "Disable Javascript". Loading the Google home- and search page is also noticeably faster, by the way.


I think it's true that language implementations such as Ruby and Python spend most of their time running the C parts of the code. I did a talk saying the same thing about Ruby a couple of weeks ago, but referring to the Java code in JRuby, https://ia601503.us.archive.org/32/items/vmss16/seaton.pdf.

But this doesn't mean that a JIT is not going to help you. It means that you need a more powerful JIT which can optimise through this C code. That may mean that you need you to rewrite the C in a managed language such as Java or RPython which you can optimise through (which we know works), or maybe we could include the LLVM IR of the C runtime and make that accessible to the JIT at runtime (which is a good idea, but we don't know if it's practical).

I work on an implementation of Ruby, and we make available the IR of all our runtime routines (in our case implemented in Java) to a powerful JIT, so that we can inline from the interpreter into the runtime and back again.

In the case of Python, PyPy does the same thing, allowing the JIT to optimise between the interpreter and runtime, as they're both written in RPython.

So I think the problem the Pyston project needs to solve is how to allow the JIT to see the runtime routines and optimise through them like it does with Python code.


Pypy makes your app take many times the memory for like 20% perf. Which is good but seems maybe often not worth the effort.


Eh...

  cat<<eof > float.py
  import itertools
  s = sum(itertools.repeat(1.0, 100000000))
  print(s)

  $ time python float.py 
  100000000.0

  real    0m0.602s
  user    0m0.596s
  sys     0m0.004s

  time python3 float.py 
  100000000.0

  real    0m0.603s
  user    0m0.600s
  sys     0m0.000s

  $ time pypy float.py 
  100000000.0

  real    0m0.211s
  user    0m0.088s
  sys     0m0.004s
That's with no warmup for the pypy variant (or indeed the other python variants). Or, slightly more "robust":

   $ python -m timeit -s "import itertools as i" \
                 "sum(i.repeat(1.0, 100000000))"
  10 loops, best of 3: 594 msec per loop

  $ python3 -m timeit -s "import itertools as i" \
                 "sum(i.repeat(1.0, 100000000))"
  10 loops, best of 3: 592 msec per loop

  $ pypy -m timeit -s "import itertools as i" \
              "sum(i.repeat(1.0, 100000000))"
  10 loops, best of 3: 68.2 msec per loop
Pypy actually does pretty good here:

  $ cat float.cpp 
  #include<iostream>

  int main() {
    double s = 0;
    for (int i = 0; i < 100000000; ++i) {
        s++;
    }

    std::cout << s << std::endl;
    return 0;
  }

  $ g++ --std=c++14 -O3 float.cpp
  $ time ./float
  1e+08

  real    0m0.237s
  user    0m0.236s
  sys     0m0.000s
Note that the C++ code use a loop, not a lazy generator. Apparently they may be coming in c++17 as proposal N4286.


Summing a list of numbers is easy mode for a JIT. You've got a tight loop with one type that can be statically shown will never be violated in real-time. Unfortunately, unless that's actually your workload, the speed with with a JIT-based system can add numbers is not relevant to how fast it runs in practice. Any JIT that can't tie C on that workload is broken somehow.

Personally, I think people often go quite overboard with the "benchmarks are useless" idea, but this benchmark really is useless, because it will never produce any differences betweens JITs and thus can't show whether one is good or bad.


> it will never produce any differences betweens JITs and thus can't show whether one is good or bad

It can tell you which JITs can't even manage to remove the loop, which is useful to know.


Apparently neither cpython, pypy or gcc manage to remove the loop in this case. I actually think it is interesting that this "slow" code in cpython is within [ed: ~10x] of pypy/jit/machine code (c++ probably should do better, I'm not all that familiar with gcc - maybe -O3 isn't enough to try to unroll loops and/or try to vectorize).

Actually code like this arguably should be a win for a high-level language with an optimization pass; ideally the whole thing should be translated to a constant at compile-time.


Ah right I think that's because the accumulator is a double. I missed that. I think it should still be possible but compilers probably don't bother.


That I don't necessarily expect from a JIT in real time. I'd expect it from any half-decent optimizing compiler, but I'd expect it to likely be the result of several interacting and too-expensive-for-real-time optimizations.

I mean, if the JIT works that out, great, and if someone wants to show off performance numbers that shows one can do that I'm interested in the information, but I wouldn't in general discard one for failing to notice that optimization.


I have a feeling 100% of the c++ time is being spent in some silliness like setting up the locale of the ostream, because my compiler totally eliminates that loop.


Probably. I glanced at the asm to make sure the loop was still there (which it was, possibly because it loops over an int, and sums doubles?) and couldn't see anything that stood out. Still a little surprised that pypy without warmup is faster than c++ for this silly thing.

[ed: On this system, eliminating the loop by hand looks like:

  #include<iostream>
  int main() {
    double s = 100000000;

    std::cout << s << std::endl;
    return 0;
  }

  $ g++ --std=c++14 -O3 float2.cpp -o float2 \
    && time ./float2
  1e+08

  real    0m0.001s
  user    0m0.000s
  sys     0m0.000s
Just for completeness.]


Real life coding is not a loop.

Try that on a Websever or an image processing box.


On Pypy's benchmark site, the speedup is a lot higher than 20%. I usually experience a 2x-3x speedup with the kind of code I run, sometimes more.

> http://speed.pypy.org/


I never quite grasped the actual machinations of Python until I watched Philip Guo's lectures on Python internals.

https://www.youtube.com/watch?v=LhadeL7_EIU

It's a bit long, and definitely over several sittings, but I feel like I really understand Python better and relevant to the post, the complexities and tracking (frames, exceptions, objects, types, stacks, references) that occur behind the curtain which drive Python's slow native performance.


Is the slowness due to the structure of the language, or is it because of the implementation?

I guess the latter, because PyPy performs a lot better I hear.


PyPy performs better, but when you perform 2-3x times better than something ~40x slower than C, you still don't end up with a "fast implementation". Just, "not as slow". If you've got Python code in hand and you want it to go faster, PyPy can have a great bang-for-the-buck, but if you want it to be legitimately approaching the limits of the capabilities of the hardware, you'll need a different approach.

But let me once again underline that if you have Python code in hand, and you want it to be faster, PyPy is a great option. I'm not being critical of PyPy.

A common mantra I've heard dozens of times in the last ~20 years is that there's no such thing as a slow language, only slow implementations. But after witnessing the effort to create "fast" implementations for a lot of slow languages over the past 10 years, and seeing so many of them plateau out at about 10x slower than C, I no longer believe this. Or at least, I no longer believe it is practically true. If there is an implementation of Python somewhere in theoretical program space that is as fast as C, it does not appear to me that it will be possible for humans to produce it.


I agree. Though speed of the JVM for instance is not quite as bad. C comes at a development cost, Rust makes this better, but the memory management is still something that you have to get comfortable with.

The question that really nags at me is why do people want interpreted languages in all of these cases? When you're deploying code, you inevitably go through a series of steps in deployment where throwing in a compile wouldn't destroy the workflow.

I think for many of these cases, the GIL is a great example of this, the language has over-optimized for development at the cost of its runtime.


I'm guessing that people don't usually really want an interpreted language; usually what they want is a language they like, and the one they like happens to be interpreted.

I can imagine reasons to want an interpreted language. As a matter of fact, I've written several implementations over the past 20 years of a hobby language, some of them compiled, some of them interpreted, and in some of the later cases I consciously chose interpretation because dynamic runtime introspection and the ability to see (expanded) source code directly in the runtime during a breakloop was something I wanted, because I was experimenting with runtime semantics and I wanted to be able to see it directly at runtime with the minimum possible change from the source code as-written.

I've also written interpreters sometimes because I'm actually interested in interpreters per se.

But most people probably don't want interpreters for those kinds of reasons. As I said, I think it's more likely that most of the time when someone wants "an interpreted language", what they really wanted is some particular language whose most prominent implementation happens to be interpreted.

That raises the question of why implementations are interpreted, of course. The answers, I think, are some combination of the answers I gave above and the fact that interpreters are really easy to write, especially if you choose the right source language. Simple compilers are not much harder, but easier is easier. I'm generally inclined to start with an easy interpreter, myself, (unless what I'm interested in is compilation strategies) because I get from zero to experimenting with semantics that much quicker, and experimenting with semantics is usually where the fun is.


To be fair, the GIL was included (AFAIK) because it was simple to implement. Also, Guido has said (2007) [1] that he would welcome a patch to remove the GIL if...:

> "... I'd welcome a set of patches into Py3k only if the performance for a single-threaded program (and for a multi-threaded but I/O-bound program) does not decrease."

Unfortunately, experiments thus far have not succeeded to meet these requirements.

There is some work being done by the Gilectomy project to try and meet this bar as well as some other requirements currently though [2]. But it is currently grappling with the afore-discovered performance issues that come with removing the GIL.

Also at PyCon 2016, Guido himself mentions the Gilectomy project and it's potential consideration (if it works) for Python 3.6+ [3].

[1] http://www.artima.com/weblogs/viewpost.jsp?thread=214235

[2] https://www.youtube.com/watch?v=P3AyI_u66Bw

[3] https://youtu.be/YgtL4S7Hrwo?t=10m59s


> no such thing as slow language, only slow implementations

My interpretation of that (and perhaps I'm reading too much into it) was that for modern-day 'retail' tasks (taking an HTTP request, querying a database, dispatching workers, running business logic, and returning a styled webpage) that going for low-level languages over high-level languages did not bear out as much improvement, esp. reconciled against development time and code flexibility.


No, I'm fairly sure this saying is used in the context of the implementations actually being slow. "You're always in IO" is a different defense used for the slow languages, one which I have also found is overstated, for what it's worth. I have found in practice it isn't that hard to build even a simple-looking REST interface that has non-trivial amounts of time spent in non-IO code in Perl or Python. I think a lot of people would be shocked if they sat down and really worked out just how little Python/Perl/etc. code they can run before they webserver noticeably starts slowing down at even small scales.


Agreed.

I spent a little time working with Guido on cache design for the App Engine data access api (ndb). In Java-land, it's a big win to cache datastore entities in memcache because fetch latency is about an order of magnitude faster. In Python-land, caching is a mixed bag - real-world profiling found marginal benefit even with 100% hit rate, and a significant performance disadvantage if the hit rate drops off. This is primarily attributed to pickling overhead.

If you're curious, here are the benchmarks (from 2011): https://github.com/googlecloudplatform/datastore-ndb-python/...


I'm rethinking my interpretation on this.


These are typical web app based tasks and IO is usually the bottleneck in such situations, not the language.


That's sort of my point, if I understand you correctly.

There are two responses to "Hey, the site is slow!" One is "Well, that's because PHP/Python/Java is slow, we should start coding in C/etc". Another is "The codebase or architecture is probably unoptimized and/or poorly structured in some areas; let's identify, prioritize and start fixing those areas."

The slow languages are "good enough", in an engineering sense, for today's retail features. The payoff of using a fast language doesn't provide enough value for its cost. It doesn't matter that searching a PHP array is slow; that's not where the problems are in today's software.


Poor article. This subject has been covered many times, and others have put in the key references.

Python's dynamism is part of the problem, of course. Too much time is spent looking up objects in dictionaries. That's well known, and there are optimizations for that.

Python's concurrency approach is worse. Any thread can modify any object in any other thread at any time. That prevents a wide range of optimizations. This is a design flaw of Python. Few programs actually go mucking with stuff in other threads; in programs that work, inter-thread communication is quite limited. But the language neither knows that nor provides tools to help with it.

This is a legacy from the old C approach to concurrency - concurrency is an OS issue, not a language issue. C++ finally dealt with that a little, but it took decades. Go deals with it a little more, but didn't quite get it right; Go still has race conditions. Rust takes it seriously and finally seems to be getting it right. Python is still at the C level of concurrency understanding.

Except, of course, that Python has the Global Interpreter Lock. Attempts to eliminate it result in lots of smaller locks, but not much performance increase. Since the language doesn't know what's shared, lots of locking is needed to prevent the primitives from breaking.

It's the combination of those two problems that's the killer. Javascript has almost as much dynamism, but without extreme concurrency, the compiler can look at a block of code and often decide "the type of this can never change".


Note that threading is not the only place this can happen: this can happen as the result of signal handlers as well. cpython has can trigger signal handlers between any python opcodes.


(based on discussion, can't get to website)

Any discussion that does not compare to LuaJIT2 is suspect in its conclusions. On the surface, Lua is almost as dynamic as Python, and LuaJIT2 is able to do wonders with it.

Part of the problem with Python is (that Lua doesn't share) is that things you use all the time can potentially shift between two loop iterations, e.g.

    for x in os.listdir(PATH):
       y = os.path.join(PATH, x)
           process_file(y)
There is no guarantee that "os.path" (and thus "os.path.join") called by one iteration is the same one called by the next iteration - process_file() might have modified it.

It used to be common practice to cache useful routines (e.g. start with "os_path_join = os.path.join" before the loop and call "os_path_join" instead of "os.path.join"), thus avoiding the iterative lookup on each iteration. I'm not sure why it isn't anymore - it would also likely help PyPy and Pyston produce better code.

This is by no means the only thing that makes Python slower - my point is that if one looks for inherently slow issues that cannot be solved by an implementation, comparison with Lua/LuaJIT2 is a good way to go.


It used to be common practice to cache useful routines (e.g. start with "os_path_join = os.path.join" before the loop and call "os_path_join" instead of "os.path.join"), thus avoiding the iterative lookup on each iteration.

I'll admit to being skeptical to how big an effect this would have, but it was bigger than I thought:

  %%timeit
  s=0
  for i in itertools.repeat(1.0, 1000000):
      s+=math.sin(i)

  10 loops, best of 3: 124 ms per loop
vs

  %%timeit
  s=0
  math_sin=math.sin
  for i in itertools.repeat(1.0, 1000000):
      s+=math_sin(i)

  10 loops, best of 3: 89.1 ms per loop


Not sure how this timeit is running, but there are two things going on: Local name lookups are done as array indexing; whereas global lookups are done as cascading dictionary lookups. And the attribute lookup also is a dictionary search. On instances, looking up a method defined in a super class involves failed lookups in the instance and all the base classes in a definition. In a hot inner loop, definitely worth cutting out


But why does the programmer need to do this? Should the language interpreter be able to implicitly perform this operation?

Isn't that what the .pyc files are for, so that it doesn't need to perform lookups like this at runtime?


There is a difference in semantics. Someone else might actually want os.path to change between call. The only problem is that the behavior of the idiomatic version is hard (almost impossible) to optimize, even though the more optimizable semantics are actually what most users want.

The language was designed with expressiveness in mind, and it often comes at the expense of speed. Lua and Nim seem to strike a much better balance, and even JavaScript if you avoid the performance killers like "with".


> Someone else might actually want os.path to change between call.

Who? Why?


The os.path is not a great example, but imagine a loop where inside the body you mutate some state of an instance and then directly access it. Compare the following with the os.path example:

  for animal in circus.animals:
      circus.next_free_clown.assign(animal)
      # circus.next_free_clown changes in every iteration


Monkeypatching for debugging, mocking, etc.

Most of those are rare, but feasible. Monkeypatching a global as a side effect even rarer, but I think I've done it at some point.


No, the .pyc files are just the source code translated to bytecode when you import the .py file (notice that running with "python somefile.py" doesn't generate a somefile.pyc, the .pyc is created when you import, from inside python or another script).

.pyc files are created for "faster importing", and they remove having to parse and "compile to bytecode" the .py files.


The exact issue is that it cannot. The reason is variables can change on the runtime, and often dependant on input. Say I can have : if blah: os.path.join = lambda ...

and then the pre-looked-up version of method are now wrong.


Except that Lua has the almost the same problem, and Mike Pall solved it in LuaJIT. So did the Truffle/Graal team with JRuby+Truffle.

Lua has the same property that global methods are hash table lookups. What LuaJIT does is that it specializes hash table lookups to the key so that looking up an item in a hash table is two instructions which can be executed at the same time on a super-scalar CPU.

LuaJIT also gets serious wins from lifting the lookups out of loops since they can't change. This is only due to the fact that Lua doesn't have threads that can muck with each other's state. JRuby+Truffle solves this with multithreading by assuming that it doesn't change and deoptimizing into a slow path if it detects a change.


>> Part of the problem with Python is (that Lua doesn't share) is that things you use all the time can potentially shift between two loop iterations, e.g.

    for x in os.listdir(PATH):
        y = os.path.join(PATH, x)
        process_file(y)
>> There is no guarantee that "os.path" (and thus "os.path.join") called by one iteration is the same one called by the next iteration - process_file() might have modified it.

> LuaJIT also gets serious wins from lifting the lookups out of loops since they can't change. This is only due to the fact that Lua doesn't have threads that can muck with each other's state.

Would Lua code equivalent to the above Python have the same problem - that "process_file" could modify the value of "os.path"? Would this prevent LuaJIT from hoisting the lookup out of the loop, even in the absence of threads?


Idea: I wonder if it would be profitable to have a mode for Pyston/PyPy/ZiPy/Whatever that disables threads, or perhaps an annotation that stops other threads from being scheduled inside certain loops.

This would allow so many checks to be hoisted out of the loop like LuaJIT does, that otherwise can't be hoisted because in theory some thread could re-assign os.path in the middle of it. If this could provide a 2-5x speedup, in most scenarios it would outweigh the performance benefit of parallelism, as well as being easier than paralellizing.


There isn't even any guarantee that "os.path" is the vanilla iterative lookup it's presented as. For all we know, it could be a @property-decorated method thousands of lines long.


But as long as you know it is the same thing, you could move ("hoist") it outside the loop. In Lua, you generally can (and LuaJIT2 does). In Python, rarely if ever.


and that's when you dont do the optimization.


Part of the problem with Python is (that Lua doesn't share) is that things you use all the time can potentially shift between two loop iterations

Why is this problem not shared by Lua? Does Lua somehow prevent "process_file" from modifying the equivalent of "os.path"?


Due to aliasing problems, you have exactly this issue in C and C++ as well, yet those languages are both massively faster than the languages in question.


While true, an aliasing 'miss' in C just causes a pointer reload, while the same miss in Python causes a hash table lookup.


it causes much more than a hash table lookup: it causes an attribute lookup, which is ...

     1. an instance __dict__ lookup, 
     2. a __getattr__ call (itself involving another hash lookup)
     3. a class property lookup
     4. superclass property lookups (if there are any superclasses)
If it was just a hash table lookup, it would be LuaJIT2 fast. But it's more like 10 hash table lookups and indirections, and that's assuming you don't actually try to hook any of this machinery to do something.


A C compiler doesn't have to assume that the definitions of functions called directly can change. LLVM couldn't optimize calls to memcpy if it couldn't assume its semantics, just to name one example.


The following may also be of interest:

- Why Python is Slow: Looking Under the Hood [0]

- Fast Python, Slow Python by Alex Gaynor [1]

[0] https://jakevdp.github.io/blog/2014/05/09/why-python-is-slow...

[1] https://www.youtube.com/watch?v=7eeEf_rAJds


Python is high level glue between efficient C functions. It's blindingly fast if you are allowing those efficient C functions to do the heavy lifting. That's why you can do image processing in python, right up until the point where something you need to do doesn't have a ready made primitive and then your program will slow down tremendously if you don't take the time to write the thing you're missing in C.


As glue, it isn't particularly efficient, both in runtime or in programmer affordances. Ctypes and cffi are both fairly new and not very friendly. The number of programmers who "drop to native" is ridiculously small. A better glue language would make this almost transparent.


every python programmer who has every used Numpy or Pandas, and in my opinion this is where Python shines and why it's so huge in scientific programming, is "dropping into native". So actually a large amount of people are doing so. And I find Python to be an excellent glue language with almost anything I can think of being possible, and much of my heavy lifting being extremely efficient, especially if you use a modern AVX-enabled Numpy.

Arguably anybody who ever accessed a database in Python is also "dropping into native". That's why no sane database is written in Python, but plenty of database-using applications are.

The only language I have discovered that approaches the efficiency of Python as "glue" is R, but it's about 20x slower, and doesn't even try to be threaded (which can be a big problem for IO sensitive glue tasks).


Because extensions that use native code exist doesn't mean that low friction affordances exist for the median programmer to use native code in their applications. I think we will start to see some interesting projects in this space after Python 3.6 ships.


While Cython was made to "write C with Python", it also allows you to write wrappers to use C libraries from python, and quite easily too (compared to other alternatives).


1) you don't need to write c, to use it.

2) the number of devs who drop to "native" is ridiculously small because the problems that require that much effort to optimise is ridiculously small.


Not even just efficient C functions. I use PyCUDA all the time to go even faster! Really beats writing c-cuda by hand for me. Numba looks even better but I haven't tried it out myself yet.


Actually I never found Python "slow". I found it "slow" in some things, but not "all things are slow with python". What was problematic was pulling big lists from a database and doing some stuff with the list.

Also it was really akward that "threading" is not really great on python. Especially not when you are using 16 core servers. I mean you could create vm's for that or dockerize that. but that means deployment complexity increases which wasn't our goal. But I've seen a lot of successful python deployments and if you have enough manpower you pretty sure can run with CPython just fine.


What you mean is that Python is fast enough for your purposes, which is great for you but it's not what this post is about.

I think it's pretty non-controversial to say that basic language operations in Python are slower than in other language implementations, which is what this post is talking about.


On a 16 core server you can run 32 copies of your program Ala Celery. It will peg the CPUs just dandy..


My experience is that it is not "dandy". We have had a lot of trouble pegging the CPUs on our celery worker boxes (doing CPU bound jobs). You get more than one CPU utilised, sure, but we never can seem to get all cores fully utilised. We rewrote some of the tasks into a single multi-threaded JVM process pulling off the rabbit queue and they instantly and consistently pegged every CPU at 100%. I wish I knew how to get our celery worker farm to full utilisation because it would save us a fair bit of money.


He's saying run 32 individual processes, not 32 threads within one process. Python's global interpreter lock will knobble you if you're using threads.


Celery has a worker pool of separate Python processes that jobs can be offloaded to. It side steps the GIL because it doesn't use threading.


Well two ways. Either launch 32 or 64 copies of the process using multiprocessing, or 1000 threads with geventlet.

I have never found a box I couldn't peg :)

I guess if you had 1 gb ram and 16 cores it would be challenging in python.. but a few gb of ram and we are on.


We are using processes in celery, not threads. Yet we still can't get to full utilisation. On an eight CPU machine we get around 650% - 750% CPU utilisation. It varies, it goes up and down in that range. We never get to 800%. The JVM version is just constantly 800%.


Can someone ELI5 why Python is more like rendering HTML than executing JS? This is confusing to me since much of node.js/V8 is C, yet AFAIK (from the title and my experience) it's faster, and I don't recall anything intrinsically more declarative (i.e. HTML) when writing python compared to JS. They both feel very similar as scripting languages to me.

IOW, from my limited ignorant perspective, this feels more like the WHAT than the underlying differentiating WHY.

It's possible it's in there and I missed it.

EDIT: FWICT from the linked slides[1], it's the result of 2 issues: 1. Expensive dynamic language features and 2. python is like node.js, but as if you only called V8 bindings and so VM performance was irrelevant. This is strange to me; while I feel I can conceptualize the difference, I still don't know enough to understand why it is so compared with node.js.

[1] https://docs.google.com/presentation/d/10j9T4odf67mBSIJ7IDFA...


The HTML comparison was referring to the python bytecode rather than the language. A more apt comparison would be CISC vs RISC processors, where the V8 IR is more like a RISC processor.


The argument is that Python spend most time in the C runtime, but that is really only a property of the current implementation. If I use Jython or even implement Python on top of bare metal, making Python the OS, that will not be spending any time in the C runtime because there wouldn't be one, but that's besides the point.

The question is whether an implementation could be made significantly faster than the current Python interpreter is, or whether there are properties of the Python semantics that makes that hard. One such thing is the amount of dynamic behavior that is allowed in Python. That require most variables to be boxed, even for basic types like numbers. There are dynamic languages (LuaJIT was mentioned, but Javascript, Julia and several Lisp implementation could be mentioned as well) that are considerably faster than Python, so why isn't Python fast?

Personally I think that Python could be made quite a bit faster than it is, but such a new Python system would almost certainly be incompatible with a large number of the Python libraries that interface with native code. For most Python users, this availability of libraries is a major motivation for using Python in the first place, and a new fast implementation without such compatibility would be worthless for most users.


From the linked [lwn] article:

"Another example he gave demonstrates the slowness of the C runtime:

    import itertools
    sum(itertools.repeat(1.0, 100000000))
That will calculate the sum of 100 million 1.0s. But it is six times slower than the equivalent JavaScript loop. Float addition is fast, as is sum(), but the result is not.

Larry Hastings asked what it was that was slowing everything down. Modzelewski replied that it is the boxing of the numbers, which requires allocations for creating objects. Though an audience member did point out with a chuckle that you can increase the number of iterations and Python will still give the right answer, while JavaScript will not."

Reminded me about the excellent talk about Julia: "Julia: to Lisp or not to Lisp?" https://www.youtube.com/watch?v=dK3zRXhrFZY

One things he points out early is that both the C99 and the R6RS Scheme spec is 20% numerical. Because correct and (reasonably) fast numbers and arithmetic is actually pretty hard to get right on a computer - if you want to abstract away hardware "short-cuts" and allow for precise arithmetic by default.

It will be interesting to see how much type hints (eliminating some of the boxing/unboxing) will help python. And if it turns out to really be a good fit for the language -- everyone wants "free" performance, but transitioning to a (even partially) typed language is certainly not "free".

Another point with the above loop, is that ideally, even if it can't be optimized/memoized down to a constant - it really shouldn't have to be much slower than its C counterpart. Except for handling bignums in some way or other (perhaps on overflow only).

[lwn] https://lwn.net/Articles/691243/


> Another point with the above loop, is that ideally, even if it can't be optimized/memoized down to a constant - it really shouldn't have to be much slower than its C counterpart. Except for handling bignums in some way or other (perhaps on overflow only).

Sadly, bignum handling is very expensive even with type hinting. Essentially, every time two numbers is added or subtracted you have to check for overflow:

    int z = hw_add(x, y);
    if (over/under-flowed?()) {
      bignum* bn = new bignum();
      bignum_add(to_bignum(x), to_bignum(y), &bn);
      return bn;
    }
    return z;
So most modern statically typed languages (none named, none forgotten!) actually cheat and use modular arithmetic instead. For example, the straightforward way of computing 9223372036854775807 + 1 on a 64bit system in one of those languages is likely to yield an incorrect result.

Which imho is complete bullshit because there is no reason to sacrifice correctness for performance other than to win benchmarking contests.


This overflow check is extremely cheap. Just check the overflow flag on the cpu. jo +4; jmp bignum; 6 byte. C compilers didn't support it until last year, so everyone just used inline assembly. Now we have it in gcc 5 and clang 3.4

One good trick started by lua is to use double for everything and check the mantissa of the normalized result for overflows. A double can represent the whole int32 range. With double and nan-tagging you have a unique fast representation of all primitives and can use a double word stack. Which is the biggest win. Look at how python, perl and ruby are doing this. Horrible. And then look how a fast dynamic language is doing it.


Everything is relative. :) Compared to the add instruction itself, the overflow check is very expensive. But the big problem is that your types widen. E.g:

    // x and y are hardware integers >= 0
    var z = x + y
    for (var i = 0; i < z; i++) {
        ...
    }
If your language used modular arithmetic, 'z' would also be a hardware integer and the induction variable 'i' could be kept in a cpu register. But with real arithmetic using bignums you can't assume that anymore and each iteration of the loop must do a relatively expensive comparison of 'i' against 'z'. In pseudo code:

    var z = x + y
    var i = 0
    if (!hw_int(z)) i = new_bignum(0)
    while (true) {
      // i < z
      if (hw_ints?(i, z)) {
        if (i >= z) break
      } else {        
        if (bignum_gte(i, z)) break
      }        
      ... body ...
      // i++
      if (hw_int?(i)) {
        i++;
      } else {
        i = bignum_add(i, 1)
      }
    }


No, it's super cheap. overflows happen <1%, so you mark it as UNLIKELY, and the branch prediction units just don't care about all those bignum branches.

If you know to use bigints or bignums, just type it and use their variants from the beginning.

Most better dynamic languages already use tagged ints, so they have only say ~30bits for an int and do a lot of those overflow checks and still beat the hell out of the poorly designed dynamic languages with full boxed ints or nums.


My understanding is that there is no plan to use type hints for optimization. I'm on a phone so I can't find it now, but I previous talked to some people on here about it in comments.


Please let us know if you find a reference to that - I've not seen anything either way (other than perhaps eg: cython piggybacking on the syntax for its own use - but not something strictly speaking "in (c)python". That said, according to: https://www.python.org/dev/peps/pep-0484/#rationale-and-goal...

"This PEP aims to provide a standard syntax for type annotations, opening up Python code to easier static analysis and refactoring, potential runtime type checking, and (perhaps, in some contexts) code generation utilizing type information.

Of these goals, static analysis is the most important. This includes support for off-line type checkers such as mypy, as well as providing a standard notation that can be used by IDEs for code completion and refactoring."

So optimization doesn't appear to be a non-goal - as much as not a primary goal?


I did say "no plan", not "never gonna happen." I don't think even Guido could really promise the latter.

I went back and looked, and this is the best reference I could find: https://news.ycombinator.com/item?id=9847740. It's not much more informative than the present discussion, and the best I can say is that I don't know anything about efforts to make cpython use the annotations for any changes in runtime behavior (either through code generation, or at runtime).


Optimization is only a distant goal according to Guido---I think he said so in the PEP or the dev mailing list discussion thereof.


The notion that Python spends most of its time in its C code isn't particularly insightful. That's how interpreters traditionally work. It just raises the question of why that C code is slower than other interpreters.

Personally, I don't mind Python's speed. I don't use it for runtime speed, I use it for development speed. If I need runtime speed I use C++. Lately, though, I'm getting to the point where I write Common Lisp about as quickly as Python and it typically runs 4-5x faster than Python, so I've just been using that.


Python has had some surprising costs but some well-known cases have straightforward work-arounds.

For example, at one point, the way you called a multi-dot function actually mattered (not sure if this persists in the latest Python releases). If your function is like "os.path.join", the interpreter would seem to incur a cost for each dot: once to look up "os.path" and once to look up "path.join". This meant that there was a noticeable difference between calling it directly several times, and calling it only through a fully resolved value (e.g. say "path_join = os.path.join" and call it only as "path_join" each time).

Another good option is to use the standard library’s pure-C alternatives for common purposes (containers, etc.) if they fit your needs.


Yet we got a ~5% speed advantage when I refactored bytecode to wordcode https://bugs.python.org/issue26647


Just because you're in the C runtime, doesn't mean you're doing productive work. If it isn't the bytecode VM that is slow, then why is PyPy able to be much faster in many cases.

In my experience, Python is easily 10-20x slower than a compiled language when doing computational work where you can't just call into a big C function, just like you would expect to see from any interpreter. I won't generally use it for anything data intensive.


What's expensive about the runtime is the redundant type/method-dispatch not the opcode dispatch. The runtime is constantly checking types and looking up methods in hash tables.

Gains can be made by "inter-bytecode" optimization in the same vein as inter-procedural optimization.

If you can prove more assumptions about types between the execution of sequential opcodes, you can remove more type checks and reduce the amount of code that must be run and make it faster.

E.g.:

    01: x = a + b
    02: y = x + c
If we have already computed that a and b are Python ints, then we can assume that x is a Python int. To execute line 2, we just then need to compute the type of c, thus saving time.

The larger the sequence of opcodes you are optimizing over, the better gains you can make.

Right now I think Pyston uses a traditional inline-cache method. This only works for repeated executions. The code must eagerly fill the inline cache of each opcode to get the speed I'm talking about.

Another reason Python's runtime is slow is because there is no such thing as undefined behavior and programming errors are always checked against. E.g. The sqrt() function always checks if its argument is >= 0, even though it's programming error to use it incorrectly and should never happen. This can't be fixed by a compiler project, it's a problem at the language level.

Being LLVM based, I think Pyston has its greatest potential for success as an ahead-of-time mypy compiler (with type information statically available). IMO leave the JITing to PyPy.


Why can't someone just write another compiled language with 100% Python syntax? e.g Julia. only more general purpose.


Because the libraries keep people on Python and you'd need to support all those too.

Edit to add, my understanding is that the flexible nature of Python makes it hard to swap out the python without implementing all the stuff that makes it flexible. In a way I guess pypy is the python you're talking about. And it does exist, and is faster, but doesn't support all libraries.


I think Nim (http://nim-lang.org/) is what you are looking for.


Is that a joke? cause I find it really funny..


why?


Nim's syntax is far away from that of Python its more like "clean" JavaScript.

I wasnt intending to be sarcastic. I apologize if so.


I think it's very pythonic.

Can you give examples?

Also, "clean" JavaScript can look like CoffeeScript which look very much like Python/Ruby, you can almost say that "clean" versions of languages tend to converge to something like that for some ppl


CoffeeScript is terrible. Full of implicit rules and magic. Python generally seems to be the opposite.


I don't think it's magical syntactically. Semantically, maybe?


The syntax is not 100%. From variable delaration to delacring functions


There are certain differences but I wouldn't say the syntax is "far away". It's incredibly similar to Python.


That's basically Cython. Cython normally gives you 40-100% speedup over python without making any changes to the code and can often give you 10-100x speed up with some relatively minor changes to the code.


Maybe some people use Python because of the semantics. If you just replicate the syntax would it really be Python any more?


You can't separate the (full) syntax from the semantics, so he means both.


Julia is general-purpose already.


I thought they were focused towards technical computing?


That sort of focus is expressed by the library ecosystem, not the language itself. You can write any type of program you want with Julia, you'll just be doing more work if you're building something that can't fully rely on preexisting libraries. As the library ecosystem becomes broader and more mature, this issue goes away.


Cython?


The arguments presented here are extremely dumb. Of course everyone knows already that the ops itself are slow. What not many people know is that all this could be easily optimized. javascript had the very same problems, but had good engineers to overcome the dynamic overhead. php7 just restructured their data, lua and luajit are extreme examples of small data and ops winning the cache race. look at v8, which was based on strongtalk. look at guile. look at any lisp. All these problems were already solved in the 80ies, and then again in the 90ies, and then again in the 2000ies.

python is similar to perl and ruby plagued with dumb engineering. python is arguably the worst. They should just look at lua or v8. How to slim down the ops to one word, how to unbox primitives (int, bool), how to represent numbers and strings. How to speed up method dispatch, how to inline functions, how to optimize internals, how to call functions and represent lexicals. How to win with optional types. Basically almost everything done in python, perl and ruby is extremely dumb. And it will not change. I'm still wondering how php7 could overcame the very same problems, and I believe it was the external threat from hhvm and some kind of inner revolution, which could convince them to rewrite it. I blame google who had the chance to improve it, after they hired Guido, and went nowhere. You still have the old guys around resisting any improvements. They didn't have an idea how to write a fast vm's in the old days, and they still don't know.


The semantics of the language are extremely flexible, that flexibility add a ton of "yeah but" when trying to do obvious things. "yeah but what if they monkey patched boolean?" ... If I were starting a Python VM from scratch, objects would be closed and only deopt if they were mucked with. Most Python code in the wild is written in a dynamically typed version of Java. Yes most Python is Java w/o the types. Typed inference, escape analysis and closed classes will allow for the same speed as the JVM.


Yes, but then it's not python anymore. You will get a similar speedup for a fully dynamic python with easier optimizations, as done with v8 or lua. Even without types. types and attributes (like closed classes) do help, but a proper vm design is more important.

But not within this community. It needs to be a completely separate project, as cython, mypy or pypy. pypy is pretty good already, but a bit overblown.


Agree, but Pythonic-Python, isn't Python the language, it is dynamically typed Java. It isn't considered good form to use the flexibility that is there.


> javascript had the very same problems, but had good engineers to overcome the dynamic overhead

Sure, it also had Google to funnel an insane amount of money and engineering time/skill into it. If it's so simple to do then put all your ideas into a PEP and submit it.


A PEP? python needs to be rewritten completely. cython, pypy or mypy are good starts, but their op and data layout is still suboptimal and not best practice.

And I have no interest in python as language, only in fast dynamic VMs. e.g. potion could just add a python parser and get a 200x function call speedup, just as done for ruby or perl. The problem is the library.


I think the last thing rurban needs is to jump into another VM. He's already got a few forks of Perl on GitHub[1].

1: https://github.com/rurban


Python is made faster by optimizing the C implementation of it's opcodes, not by doing any optimization at a higher level?

Really?

Sounds more like; It's more convenient for us to try to optimize cpython at the opcode level because its technically difficult to apply JIT techniques at a higher level because of the way cpython is implemented.

    Pyston's performance improvements come from speeding up the C code, not the Python code.

    When people say "why doesn't Pyston use [insert favorite JIT technique here]", my question
    is whether that technique would help speed up C code.  I think this is the most fundamental
    misconception about Python performance: we spend our energy trying to JIT C code, not Python 
    code.  

    This is also why I am not very interested in running Python on pre-existing VMs, since that 
    will only exacerbate the problem in order to fix something that isn't really broken.
...really?

All I can fathom is that this project is about trying to take the existing cpython implementation and make it faster by applying various magical hacks at a very low level, rather than trying to address any of the more difficult problems about why the cpython runtime is slow.

This is exactly the opposite approach from pypy (ie. reimplement cpython in a way which is fundamentally better); and it certainly seems to be yielding some interesting results.

...but I think I'm a little skeptical that its the only solution.

It just happens to be the solution they've decided to pursue.


Who cares? You're not using it because it's performant. You use it because it's fast to code in and to ship. If you only care about performance, write C or Java or an assembler and extend your ship date.


Obviously being down voted for being too pragmatic.


Important clarification: when the author says "C runtime", what he means is "Python runtime written in C". The C runtime is, of course, libc, libm, etc. I'm not sure why the author thinks Python's high-level IR (please don't call it a VM) is a good thing, or unoptimizable (to a better IR or native code). Perhaps he's never read about Smalltalk/Self optimization (which is eye-opening!)


For me this post isn't an argument against Mozilla implementing a similar dialogue; it's an argument for implementing one and improving it.

To really have meaningful improvements would require totally reworking how extensions deal with the DOM and requiring them to ask for different levels of permission. That would break most plugins as written and probably require significant work to implement.


imho it's about references (pointers) and the inability of the memory prefetcher to optimize memory accesses. to fix this languages need true value types.


Why is website down


It has probably been "slashdotted" (in our time: hackernewsed). :D

https://en.wikipedia.org/wiki/Slashdot_effect


Something in my server's configuration eats more and more memory, until the OOM killer decides that killing the MySQL database looks like a dandy way to reclaim memory.


Try Varnish, for stuff like this it's pretty perfect.


Too much traffic from HN, the database(probably SQL) could not make as many concurrent connections. Hence the scaling problem.


Very nice to see some circular reasoning here. Because python is slow everyone dispatches to C code. Because everyone dispatches to C code, optimising the python interpreter is not worth it.


Error establishing a database connection


At least it doesn't emit the entire stack trace along with the message!

Question: are static websites so hard to do?


Not everybody is comfortable updating, managing it if you refer to Jeckyll or similar generators.


This is a WordPress blog - it's just a matter of installing the SuperCache plugin and turning on mod_rewrite caching - then all page requests are served by apache out of static files. It even handles regeneration when there's a new comment, while continuing to serve the old static file to avoid the thundering herd problem.


Static websites take away one of the crucial pieces of blogging: Comments.


If you make the page static, and have it AJAX back to load the comments, then you get the best of both worlds.

(This is what I wrote for my blog.)


Its trivial in Jekyll to add support for a 3rd party comment service, e.g. Disqus.


Only if you are comfortable using a 3rd party comment service (or injecting 3rd party javascript in your site in the first place). I know it's common, but even with the advances in CORS and whatnot, it still (IMNHO) defeats a lot of the benefits of a static web site.

That said, I would probably prefer embedding disqus or https://muut.com/ comments to running a complex php application just for comments.


Is it?

The crucial flaw of public comments is that they can be used to address other people than the author; hence spam and trolls for instance. And what's the ratio of actually useful comments anyway? Does it justify the cost of dealing with all the rest (like, say, your blog going down because some comment DB broke which directly hurts your actual readership)?

One-way communication has its advantages, too. Those who really have something to tell you can always use email.


http://ikiwiki.info/ supports comments.

There are several variants of "static website":

* Site is generated by its author, and won't change until author changes and re-generates it.

* Site generates content when something changes (e.g., new comment), not on each page view. (Cf. ikiwiki in cgi mode.)


To this day we still have wordpress blog without caching plugin ?


Why is database slow


Why is website architecture slow


It's very likely that this blog runs on a small VM (eg. the smallest DigitalOcean one) which not enough RAM. This results in mysql crashing. You could either setup a cache or add some swap as backup...


I thought that was the joke!


Any script language is slow


Not really. LuaJIT and V8 do pretty well on benchmarks. Maybe not "fast", but they certainly aren't slow.


'script language' isn't a technical definition.

Do you mean 'interpreted'? 'dynamic'? Even those terms need to be carefully qualified and counter-examples exist in each case.


Only the old dumb ones (python, perl, ruby, formerly also php). Normal dynamic script languages are written by engineers and are therefore very fast.


Are you implying that people like Larry Wall, Yukihiro Matsumoto and Guido van Rossum, all of who have advanced degrees in computer science are not (good?) engineers?


Well, to be fair, the original Ruby interpret was a very naive AST-walking interpreter. Not that Matz isn't a good engineer; I think he probably just wanted to experiment. (Take a look at the Ruby parser sometime if you need confirmation that Matz is both brilliant and possibly insane.)

Perl's internals are a little weird but are really fast at text processing. I wouldn't necessarily hold Perl up as a good piece of software engineering, but it's still much better than what most people could write.

(I'm agreeing with you though. Your parent's comment is absurd.)


They might be good hackers, but have no idea how a VM should be implemented. That's why you got this mess.

Just look how they designed the ops, the primitives, the calling convention, the stack, the hash tables. This is not engineering, this was amateurish compared to existing practice.


Can you give some examples of more professional VMs which support dynamic languages without an extra compilation step apart from the JVM?


I gave already and I didn't call them more professional. I called them engineered, in opposite to those simple adhoc implementations. The only part why worse is better in this regard was because of getting business attraction.

lua, v8, guile (but just last year), smalltalk and family (esp. strongtalk and the other pre-v8 projects), ml and family (MLton and ocaml), haskell (after many years of being too slow), any better scheme (>20), any better lisp (>5), php7 (just last year), and then the tiny lua based ones, like potion, wren, tinyrb, tinypy or tvmjit (with simple method ast->jit), or partially io, lily, or the other post-smalltalk languages, like maru (simpliest ast->jit) or the soda family around alan key and ian piumarta.

when you design a 20-100x slower calling convention without proper support for lexical variables, slow data, no unboxing, no tagging, slow ops, no inlining, no fast memory reclamation you are in the pre-SICP days and should be treated like that.

not even talking about the modern static languages and type systems which are now catching up, via llvm. not go or rust, but stuff like pony or julia.


Well, C/Fortran/C++/ASM are a PITA when it comes to dynamic structures like the one for handling configurations.

Plus authors missed that boxing in python has a tendency to fragment data in a non controlable way in the memory thus making the use of L1/L2/L3/mem (on x86) or other memory architecture with similar layout very hard to use.

If you code often you know the 80/20 rule: 80% of your code is the setup preparing for the 20% of heavy lifting.

Numpy (which relies on Fortran) is a nice Proof that when done correctly python is really useful.

A computing a Moving average in pure python is 10 000 times slower than using numpy (especially if you use FFT).

So ... I am saying since python (like Tcl or Perl) is a good language for doing FFI (foreign function integration) it should be used this way.

And thanks to the often unfairly hated GIL it enables to use un-thread safe foreign language library in a thread safe way.

All being said and done, if python used this way is slow, I dare say it is because some coders do not understand how to build there data/execution flow. And this is not language dependent, but a question of coder.

I thought they were 10x coders in the past.

I recently realized there are /10 coders in fact.

The one that are pissed at coding when «it does not work the way it should» and expect language to be magically doing most of the job without learning.

The 1x coders on the other hands are boring, slow coders and accept that when the «stuff» is not behaving the right way, it may not be the stuff that is not working, but him/her that have a misconception.

1x coder is not a state, it is a trajectory that can degrade or improve with new challenge and poor/good state of mind, hence the misconception on 10x.


It is possible to make a dynamic language natively compiled and very fast. Common Lisp is arguably even more dynamic than Python, and SBCL manages to generate code that rivals C in performance.


I don't consider Common Lisp “more dynamic” than Python; at least in that you modify code at runtime and such. Frequently Common Lisp code is not particularly dynamic, because you can get the same level of expressiveness without relying on mucking around with magic at runtime.

Also SBCL only generates really fast code if you use a lot of type hints and (optimize (speed 3) (safety 0)). That said, when you do, it's really fast.


Typically Common Lisp and its implementations provides a wide range of performance options. It's not limited to the model of some more primitive language runtimes, which provide only two modes (simplified): slow if implemented in the language, fast if it runs mostly in its runtime and library functions written in C. There are Common Lisp implementations which follow this model, too. But there are also many which have sophisticated runtimes with optimizing native code compilers.

To get to really fast code in Common Lisp one needs to write relatively low-level code with type declarations, optimization hints, low-level operators etc.

There is a part of Common Lisp which is as dynamic (more?) as Python: everything written in CLOS (multi-dispatch, multi-methods, method-combinations, ...) and expecially when using the CLOS MOP. Is it slower or faster than comparable Python code? I have no idea and I haven't seen any interesting benchmarks. The amount of CLOS use depends on the implementations. Some implementations have a lot of their library code and also much of the language itself (everything IO, error handling, ...) written using CLOS.


I have written plenty of heavily CLOS-based code that is still very fast. I doubt that Python could come close in performance.

I think it would be an interesting exercise to do some benchmarks to get some actual data behind both of our guesses.

As I am not a very experienced Python developer, would you be interested in writing some test code that is representative of typical Python code, and I'd be happy to port it over to CLOS and so that we can run some benchmarks?

I think such results would be of interest to the HN audience.


Well, whatever the problem is, setting up the processing, parsing input files, checking inputs might not need to be fast but much more secure and user friendly.

This -at my experience- often requires very «dynamic» structures in the form of intricated hash table/dict that requires to be correctly set, malloced, modified. JSON like stuff.

You may do it safely in python/perl/php/tcl way more than handling your memory in C/C++/ASM. It is easier, more error prone, and faster. And no real speed is required.

Sometimes what we search is the right tool for the right problem.

Setting up a software is often a big part of the code that consume a lot of lines of code, while often a small part of the execution time. And the priority is often on correctly handling/checking the input without buffer overflow/undefined behaviour much more than speed.

So I totally think that the idea of using the right tools for the jobs is a good idea. Imagining one language can do all, is pure religion.

I wouldn't even be surprised that one day we have a description language used only for setting up and distributing messages à la VHDL instead of trying to "integrate it" either in the OS with "dynamic buses" or with "runtimes".




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: